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1  Introduction 


In  many  domains,  it  is  important  to  acquire  information  about  a  relation  between  two  sets. 
For  example,  one  may  wish  to  learn  a  “has-part”  relation  between  a  set  of  animals  and  a  set 
of  attributes.  We  are  motivated  by  the  problem  of  designing  a  prediction  algorithm  to  learn 
such  a  binary  relation  where  the  learner  has  limited  prior  information  about  the  predicate 
forming  the  relation.  While  one  could  model  such  problems  as  concept  learning,  they  are 
fundamentally  different  problems.  In  concept  learning  there  is  a  single  set  of  objects  and  the 
learner’s  task  is  to  classify  these  objects,  whereas  in  learning  a  binary  relation  there  are  two 
sets  of  objects  and  the  learner’s  task  is  to  learn  the  predicate  relating  the  two  sets.  Observe 
that  the  problem  of  learning  a  binary  relation  can  be  viewed  as  a  concept  learning  problem 
by  letting  the  instances  be  all  ordered  pairs  of  objects  from  the  two  sets.  However,  the  ways 
in  which  the  problem  may  be  structured  are  quite  different  when  the  true  task  is  to  learn 
a  binary  relation  as  opposed  to  a  classification  rule.  That  is,  instead  of  a  rule  that  defines 
which  objects  belong  to  the  target  concept,  the  predicate  defines  a  relationship  between  pairs 
of  object. 

A  binary  relation  is  defined  between  two  sets  of  objects.  Throughout  this  paper,  we 
assume  that  one  set  has  cardinality  n  and  the  other  has  cardinality  m.  We  also  assume  that 
for  all  possible  pairings  of  objects,  the  predicate  relating  the  two  sets  of  variables  is  either  true 
(1)  or  false  (0).  Before  defining  a  prediction  algorithm,  we  first  discuss  our  representation 
of  a  binary  relation.  Throughout  this  paper,  we  represent  the  relation  as  an  n  x  m  binary 
matrix,  where  an  entry  contains  the  value  of  the  predicate  for  the  corresponding  elements. 
Since  the  predicate  is  binary-valued,  all  entries  in  this  matrix  are  either  0  (false)  or  1  (true). 
The  two  dimensional  structure  arises  from  the  fact  that  we  are  learning  a  binary  relation. 

For  the  sake  of  comparison,  we  now  briefly  mention  other  possible  representations.  One 
could  represent  the  relation  as  a  table  with  two  columns,  where  each  entry  in  the  first  column 
is  an  item  from  the  first  set  and  each  entry  in  the  second  column  is  an  item  from  the  second 
set.  The  rows  of  the  table  consist  of  the  subset  of  the  potential  nm  pairings  for  which  the 
predicate  is  true.  One  could  also  represent  the  relation  as  a  bipartite  graph  with  n  vertices 
in  one  vertex  set  and  m  vertices  in  the  other  set.  An  edge  is  placed  between  two  vertices 
exactly  when  the  predicate  is  true  for  corresponding  items. 

Having  introduced  our  method  for  representing  the  problem,  we  now  informally  discuss 
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the  basic  learning  scenario.  The  learner  is  repeatedly  given  a  pair  of  elements,  one  from 
each  set,  and  asked  to  predict  the  corresponding  matrix  entry.  After  making  its  prediction, 
the  learner  is  told  the  correct  value  of  the  matrix  entry.  The  learner  wishes  to  minimize  the 
number  of  incorrect  predictions  it  makes.  Since  we  assume  that  the  learner  must  eventually 
make  a  prediction  for  each  matrix  entry,  the  number  of  incorrect  predictions  depends  on  the 
size  of  the  matrix. 

Unlike  problems  typically  studied  where  the  natural  measure  of  the  size  of  the  learner’s 
problem  is  the  size  of  an  instance  (or  example),  for  this  problem  it  is  the  size  of  the  matrix. 
Such  concept  classes  with  polynomial-sized  instance  spaces  are  uninteresting  in  Valiant’s  [23] 
probably  approximately  correct  (PAC)  model  of  learning.  In  this  model,  instances  are  cho¬ 
sen  randomly  from  an  arbitrary  unknown  probability  distribution  on  the  instance  ^pace.  A 
concept  class  is  PAC-learnable  if  the  learner,  after  seeing  a  number  of  instances  that  is  poly¬ 
nomial  in  the  problem  size,  can  output  a  hypothesis  that  is  correct  on  all  but  an  arbitrarily 
small  fraction  of  the  instances  with  high  probability.  For  concepts  whose  instance  space  has 
cardinality  polynomial  in  the  problem  size,  by  asking  to  see  enough  instances  the  learner  can 
see  almost  all  of  the  probability  weight  of  the  instances  space.  Thus  it  is  not  hard  to  show 
that  these  concept  classes  are  trivially  PAC-learnable.  One  goal  of  our  research  is  to  build  a 
framework  for  studying  such  problems. 

To  study  learning  algorithms  for  these  concept  classes  we  extend  the  basic  mistake  bound 
model  [10,  11,  15]  to  the  cases  that  a  helpful  teacher  or  the  learner  selects  the  query  sequence, 
in  addition  to  the  cases  where  instances  are  chosen  by  an  adversary  or  according  to  a  prob¬ 
ability  distribution  on  the  instance  space.  Previously,  helpful  teachers  have  been  used  to 
provide  counterexamples  to  conjectured  concepts  [1],  or  to  break  up  the  concept  into  smaller 
sub-concepts  [19].  In  our  framework,  the  teacher  only  selects  the  presentation  order  for  the 
instances. 

If  the  learner  is  to  have  any  hope  of  doing  better  than  random  guessing,  there  must  be 
some  structure  in  the  relation.  Furthermore,  since  there  are  so  many  ways  to  structure  a 
binary  relation,  we  give  the  learner  some  prior  knowledge  about  the  nature  of  this  structure. 
Not  surprisingly,  the  learning  task  depends  greatly  on  the  prior  knowledge  provided.  One 
way  to  impose  structure  is  to  restrict  one  set  of  objects  to  have  relatively  few  “types.”  For 
example,  a  circus  may  contain  many  animals,  but  only  a  few  different  species.  In  the  first 
part  of  this  paper  we  study  the  case  where  the  learner  has  “a  priori”  knowledge  that  there  are 
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a  limited  number  of  object  types.  Namely,  we  restrict,  the  matrix  representing  the  relation 
to  have  at  most  k  distinct  row  types.  (Two  rows  are  of  the  same  type  if  they  agree  in  all 
columns.)  We  define  a  k-binary-relation  to  be  a  binary  relation  for  which  the  corresponding 
matrix  has  at  most  k  row  types.  This  restriction  is  satisfied  whenever  there  are  only  k  types 
of  objects  in  the  set  of  n  objects  being  considered  in  the  relation.  The  learner  receives  no 
other  knowledge  about  the  predicate  forming  the  relation.  With  this  restriction,  we  prove 
that  any  prediction  algorithm  makes  at  least  (1  —  fi)km  -f  n\\g((3k)\  —  (1  —  j3)k\\g((3k)\ 
mistakes  in  the  worst  case  for  all  0  <  /?  <  1  against  any  query  sequence.  So  for  —  1/2,  we 
get  a  lower  bound  of  i-p  +  (n  —  | )  [lg  k  —  1 J  on  the  number  of  mistakes  made  by  any  prediction 
algorithm.  If  computational  efficiency  is  not  a  concern,  the  halving  algorithm  [3,  15]  makes 
at  most  km  +  (n  —  k)  lg  k  mistakes  against  any  query  sequence.  (The  halving  algorithm 
predicts  according  to  the  majority  of  the  feasible  relations  (or  concepts),  and  thus  each 
mistake  halves  the  number  of  remaining  relations.) 

We  present  an  efficient  algorithm  making  at  most  km+(n  —  A:)  fig  k ]  mistakes  in  the  case 
that  the  learner  chooses  the  query  sequence.  We  prove  a  tight  mistake  bound  of  km  +  (n  — 
k ) ( k  —  1 )  in  the  case  that  the  helpful  teacher  selects  the  query  sequence  1 .  When  the  adversary 
selects  the  query  sequence,  we  present  an  efficient  algorithm  for  k  =  2  that  makes  at  most 
2m  +  n  —  2  mistakes,  and  for  arbitrary  k  we  present  an  efficient  algorithm  making  at  most 
km+n^(k  —  l)m  mistakes.  We  prove  any  algorithm  makes  at  least  km-\-(n—k)  [lg  k\  mistakes 
in  the  case  that  an  adversary  selects  the  query  sequence,  and  use  the  existence  of  projective 
geometries  to  improve  this  lower  bound  to  fi(fcm+(n—fc)[lg  fcj4-min{nv/nt,  rny/n})  for  a  large 
class  of  algorithms  Finally,  we  describe  a  technique  to  simplify  the  proof  of  expected  mistake 
bounds  when  the  query  sequence  is  chosen  at  random,  and  use  it  to  prove  an  0(km+nk>/H) 
expected  mistake  bound  for  a  simple  algorithm.  (Here  H  is  the  maximum  Hamming  distance 
between  any  two  rows.) 

Another  possibility  for  known  structure  is  the  problem  of  learning  a  binary  relation  on 
a  set  where  the  predicate  induces  a  total  order  on  the  set.  (For  example  the  predicate 
may  be  “<”.)  In  the  second  half  of  this  paper  we  study  the  case  in  which  the  learner 
has  “a  priori”  knowledge  that  the  relation  forms  a  total  order.  There  we  see  that  the 
halving  algorithm  [3,  15]  yields  a  good  mistake  bound  against  any  query  sequence.  This 

'The  mistake  bound  is  a  worst  case  mistake  bound  taken  over  all  “consistent”  learners.  See  Section  2  for 


formal  definitions. 


motivates  a  second  goal  of  this  research,  to  develop  efficient  implementations  of  the  halving 
algorithm.  We  uncover  an  interesting  application  of  randomized  approximation  schemes  to 
computational  learning  theory.  Namely,  we  describe  a  technique  that  uses  a  fully  polynomial 
randomized  approximation  scheme  (fpras)  to  implement  a  randomized  version  of  the  halving 
algorithm.  We  apply  this  technique,  using  a  fpras  due  to  Dyer,  Frieze,  and  Kannan  [7]  and 
Matthews  [18]  for  counting  the  number  of  linear  extensions  of  a  partial  order,  to  obtain  a 
polynomial  prediction  algorithm  that  makes  at  most  nlgn  -f  (lge)lgn  mistakes  with  very 
high  probability  against  an  adversary-selected  query  sequence.  The  small  probability  of 
making  “too  many”  mistakes  is  determined  by  the  coin  flips  of  the  learning  algorithm  and 
not  by  the  query  sequence  selected  by  the  adversary.  We  contrast  this  result  with  an  n  —  1 
mistake  bound  when  the  learner  selects  the  query  sequence  [25],  and  an  n  —  1  mistake  bound 
when  a  teacher  selects  the  query  sequence. 

Finally,  we  discuss  the  relationship  between  counting  schemes  and  the  halving  algorithm. 
A  majority  algorithm  takes  as  input  a  description  of  a  set  of  objects,  some  of  which  are 
distinguished,  and  outputs  a  1  if  and  only  if  at  least  half  of  the  elements  in  the  set  are 
distinguished.  On  the  other  hand,  a  counting  algorithm  outputs  the  number  of  distinguished 
elements  in  the  set.  We  present  some  preliminary  results  on  using  a  majority  algorithm  to 
implement  a  counting  algorithm. 

The  remainder  of  this  paper  is  organized  as  follows.  In  the  next  section  we  formally 
introduce  the  basic  problem,  the  learning  scenario  and  the  extended  mistake  bound  model. 
In  Section  3  we  present  our  results  for  learning  an  ^-binary- relation.  After  giving  a  motivating 
example  in  Section  3.1,  in  the  next  section  we  present  some  general  mistake  bounds.  Then 
in  Section  3.3  we  consider  when  the  learner  selects  the  query  sequence.  Next  we  consider 
when  a  helpful  teacher  selects  the  query  sequence.  In  Section  3.5  we  consider  when  an 
adversary  selects  the  query  sequence.  Finally,  in  Section  3.6  we  consider  when  the  query 
sequence  is  selected  at  random.  In  Section  4  we  turn  our  attention  to  the  problem  of  learning 
a  total  order.  We  begin  by  discussing  the  relationship  between  the  halving  algorithm  and 
approximate  counting  schemes  in  Section  4.1.  In  particular,  we  describe  how  a  fpras  can  be 
used  to  implement  an  approximate  halving  algorithm.  Then  in  Section  4.2  we  present  our 
results  on  learning  a  total  order.  Finally,  in  Section  4.3,  we  describe  techniques  to  convert 
majority  algorithms  to  counting  algorithms.  We  conclude  with  a  summary  of  our  results  and 
discussion  of  open  problems. 


2  Learning  Scenario  and  Mistake  Bound  Model 


In  this  section  we  give  formal  definitions  and  discuss  the  learning  scenario  used  in  this 
paper.  To  be  consistent  with  the  literature,  we  shall  discuss  these  models  in  terms  of  concept 
learning.  As  we  have  mentioned,  the  problem  of  learning  a  binary  relation  can  be  viewed  in 
this  framework  by  letting  the  instance  be  all  pairs  of  objects,  one  from  each  of  the  two  sets. 

A  concept  c  is  a  Boolean  function  on  some  domain  of  instances.  A  concept  class  C  is  a 
collection  of  concepts.  The  learner’s  goal  is  to  infer  some  unknown  target  concept  chosen 
from  some  known  concept  class.  Often  C  is  decomposed  into  subclasses  Cn  according  to 
some  natural  dimension  measure  n.  That  is,  for  each  n  >  1,  let  Xn  denote  a  finite  learning 
domain.  Let  X  =  Un>i  ^n>  and  x  €  X  denote  an  instance.  To  illustrate  these  definitions, 
we  consider  the  concept  class  of  monomials.  (A  monomial  is  conjunction  of  literals,  where 
each  literal  is  either  some  Boolean  variable  or  its  negation.)  For  this  concept  class  n  is  just 
the  number  of  variables.  Thus  |Xn|  =  2"  where  each  x  6  Xn  is  chosen  from  {0,  l}n  and 
represents  the  assignment  for  each  variable.  For  each  n  >  1,  let  Cn  C  2Xn  be  a  family  of 
concepts  on  Xn.  Let  C  =  Un>i  C„  denote  a  concept  class  over  X .  For  example  if  Cn  contains 
monomials  over  n  variables,  then  C  is  the  class  of  all  monomials.  Given  any  concept  c  6  Cn, 
we  say  that  x  in  c  is  a  positive  instance  of  c,  and  x  in  Xn  —  c  is  a  negative  instance  of  c. 
In  our  example,  the  target  concept  for  the  class  of  monomials  over  five  variables  might  be 
X1X4X5.  Then  the  instance  “10001”  is  a  positive  instance  and  “00001”  is  a  negative  instance. 
Fin.  "y,  ihe  hypothesis  space  of  algorithm  A  is  simply  the  set  of  all  hypotheses  (or  rules)  h 
that  A  may  output.  (A  hypothesis  for  Cn  must  make  a  prediction  for  each  x  €  Xn .) 

A  prediction  algorithm  for  C  is  an  algorithm  that  runs  under  the  following  scenario. 
A  learning  session  consists  of  a  set  of  trials.  In  each  trial,  the  learner  is  given  given  an 
unlabeled  instance  x  €  Xn.  The  learner  uses  its  current  hypothesis  to  predict  if  x  is  a 
positive  or  negative  instance  of  the  target  concept  c  G  Cn  and  then  the  learner  is  told  the 
correct  classification  of  x.  If  the  prediction  was  incorrect,  the  learner  has  made  a  mistake. 
Note  that  in  this  model  there  is  no  training  phase.  Instead,  the  learner  receives  unlabeled 
instances  throughout  the  entire  learning  session.  However,  after  each  prediction  the  learner 
“discovers”  the  correct  classification.  This  feedback  can  then  be  used  by  the  learner  to 
improve  its  hypothesis.  A  learner  is  consistent  if,  on  every  trial,  there  is  some  concept  in  Cn 
that  agrees  both  with  the  learner’s  prediction  and  with  all  the  labeled  instances  observed  on 


preceding  trials. 

The  number  of  mistakes  made  by  the  learner  depends  on  the  sequence  of  instances  pre¬ 
sented.  We  extend  the  mistake  bound  model  to  include  several  methods  for  the  selection 
of  instances.  A  query  sequence  is  a  permutation  tt  =  (xj,  x2, . . . ,  X|xn|)  °f  Xn  where  x,  is 
the  instance  presented  to  the  learner  at  the  tih  trial.  We  call  the  agent  selecting  the  query 
sequence  the  director.  We  consider  the  following  directors: 

•  Learner  -  The  learner  chooses  it.  To  select  zt,  the  learner  may  use  time  polynomial 
in  n  and  s ,  and  all  information  obtained  in  the  first  t  —  1  trials.  In  this  case  we  say 
that  the  learner  is  self-directed. 

•  Helpful  Teacher  -  A  teacher  who  knows  the  target  concept  and  wants  to  minimize 
the  learner’s  mistakes  chooses  n.  To  select  xt,  the  teacher  uses  knowledge  of  the  target 
concept,  Zx,...,xt_i,  and  the  learner’s  predictions  on  xj,...,xt_i.  To  avoid  allowing 
the  learner  and  teacher  to  have  a  coordinated  strategy,  in  this  scenario  we  consider  the 
worst  case  mistake  bound  over  all  consistent  learners.  In  the  case  we  say  the  learner 
is  teacher-dii'ccted. 

•  Adversary  -  The  adversary  who  selected  the  target  concept  chooses  n.  This  adver¬ 
sary,  who  tries  to  maximize  the  learner’s  mistakes,  knows  the  learner’s  algorithm  and 
has  unlimited  computing  power.  In  this  case  we  say  the  learner  is  adversary-directed . 

•  Random  —  In  this  model,  7 r  is  selected  randomly  according  to  a  uniform  probability 
distribution  on  the  permutations  of  Xn.  Here  the  number  of  mistakes  made  by  the 
learner  for  some  target  concept  c  in  Cn  is  defined  to  be  the  expected  number  of  mistakes 
over  all  possible  query  sequences.  In  this  case  we  say  the  learner  is  randomly- directed. 

We  consider  how  a  prediction  algorithm’s  performance  depends  on  the  director.  Let 
AZ(C)  denote  the  set  of  prediction  algorithms  for  learning  concept  class  C  with  director  Z. 
For  prediction  algorithm  A  €  AZ(C),  we  define  the  mistake  boundMB(A,Cn)  to  be  the  worst 
case  number  of  mistakes  made  by  A  for  any  target  concept  in  C„  under  any  query  sequence 
provided  by  Z.  (When  Z  =  adversary,  MB(A,Cn)  =  M a(Cu)  as  defined  by  Littlestonc  [15].) 
We  say  that  A  is  a  polynomial  prediction  algorithm  if  A  makes  each  prediction  in  time 
polynomial  in  n. 
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3  Learning  Binary  Relations 

In  this  s<cuon  we  applying  the  learning  scenario  of  the  extended  mistake  bound  model  to 
the  concept  class  C  of  ^-binary-relations.  For  this  concept  class  the  dimension  measure  is 
nm,  the  number  of  entries  in  the  corresponding  matrix,  and  Xnm  =  (1,  ■  •  •  ,n}  x  {1,-  •  •  ,m}. 
An  instance  (i,j)  is  in  the  target  concept  c  €  Cnm  if  and  only  if  the  matrix  entry  in  row 
i  and  column  j  is  a  1.  So  in  each  trial  the  learner  is  repeatedly  given  an  instance  x  from 
Xnrn  and  asked  to  predict  the  corresponding  matrix  entry.  After  making  its  prediction,  the 
learner  is  told  the  correct  value  of  the  matrix  entry.  The  learner  wishes  to  minimize  the 
number  of  incorrect  predictions  it  makes.  Since  we  assume  that  the  learner  must  eventually 
make  a  prediction  for  each  matrix  entry,  the  number  of  incorrect  predictions  depends  on  the 
size  of  the  matrix. 

We  begin  this  section  with  a  motivating  example  from  the  domain  of  allergy  testing.  We 
use  this  example  to  motivate  both  the  restriction  that  the  matrix  has  k  row  types  and  the 
use  of  the  extended  mistake  bound  model.  We  then  present  general  upper  and  lower  bounds 
on  the  number  of  mistakes  made  by  the  learner  regardless  of  the  director.  Finally,  we  study 
the  complexity  of  learning  a  fc-binary-rolation  under  each  director. 

3.1  Motivation:  Allergist  Example 

In  this  section  we  use  the  following  example  taken  from  the  domain  of  allergy  testing  to 
motivate  the  problem  of  learning  a  fc-binary  relation. 

Consider  an  allergist  with  a  set  of  patients  to  be  tested  for  a  given  set  of  allergens.  Each 
patient  is  either  highly  allergic ,  mildly  allergic ,  or  not  allergic  to  any  given  allergen.  The 
allergist  may  use  either  a  cpicutancous  (scratch)  test  in  which  the  patient  is  given  a  fairly 
low  dose  of  the  allergen,  or  a  intradermal  (under  the  skin)  test  in  which  the  patient  is  given 
a  larger  dose  of  the  allergen.  The  patients  reaction  to  the  test  is  classified  as  strong  positive, 
weak  positive  or  negative.  Figure  1  describes  the  reaction  that  occurs  for  each  combination 
of  allergy  level  and  dosage  level.  Finally,  we  assume  a  strong  positive  reaction  is  extremely 
uncomfortable  to  the  patient,  but  not  dangerous. 

What  options  does  the  allergist  have  in  testing  a  patient  for  a  given  allergen?  He  could 
just  perform  the  intradermal  test  (option  0).  Another  option  (option  1)  is  to  perform  a 
epicutaneous  test,  and  if  it  is  not  conclusive,  then  perform  an  intradermal  test.  (See  Figure  2 
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Epicutaneous 

Intradermal 

(Scratch) 

(Under  the  Skin) 

Not  Allergic 

negative 

negative 

Mildly  Allergic 

negative 

weak  positive 

Highly  Allergic 

weak  positive 

strong  positive 

Figure  1:  Summary  of  testing  reactions  for  allergy  testing  example. 

for  decision  trees  describing  these  two  testing  options.)  Which  testing  option  is  best?  If  the 
patient  has  no  allergy  or  a  mild  allergy  to  the  given  allergen,  then  testing  option  0  is  best, 
since  the  patient  need  not  return  for  the  second  test.  However,  if  the  patient  is  highly  allergic 
to  the  given  allergen,  then  testing  option  1  is  best,  since  the  patient  does  not  experience  a 
bad  reaction.  We  assume  the  inconvenience  of  going  to  the  allergist  twice  is  approximately 
the  same  as  having  a  bad  reaction.  That  is,  the  allergist  has  no  preference  to  error  in  a 
particular  direction.  While  the  allergist’s  final  goal  is  to  determine  each  patient’s  allergies, 
we  consider  the  problem  of  learning  the  optimal  testing  option  for  each  combination  of 
patient  and  allergen. 

The  allergist  interacts  with  the  environment  as  follows.  In  each  “trial”  the  allergist  is 
asked  to  predict  the  best  testing  option  for  a  given  patient/allergen  pair.  He  is  then  told 
the  testing  results,  thus  learning  whether  the  patient  is  not  allergic,  mildly  allergic  or  highly 
allergic  to  the  given  allergen.  In  other  words,  the  allergist  receives  feedback  as  to  the  correct 
testing  option.  Note  that  we  make  no  restrictions  on  how  the  hypothesis  is  represented 
as  long  as  it  can  be  evaluated  in  polynomial  time.  In  other  words,  all  we  require  is  that 
given  any  patient/allergen  pair,  the  allergist  decides  which  test  to  perform  in  a  “reasonable” 
amount  of  time. 

How  can  the  allergist  possibly  predict  a  patient’s  allergies?  If  the  allergies  of  the  pa¬ 
tients  are  completely  “random,”  then  there  is  not  much  hope.  What  priori  knowledge  does 
the  allergist  have?  He  knows  that  people  often  have  exactly  the  same  allergies.  So  there 
are  a  set  of  “allergy  types”  that  occur  often.  (We  do  not  assume  that  the  allergist  has  a 
priori  knowledge  of  the  actual  allergy  types.)  This  knowledge  can  help  guide  the  allergist’s 


9 


Figure  2:  The  testing  options  available  to  the  allergist. 
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predictions. 

Having  specified  the  problem  we  discuss  our  choice  of  using  the  extended  mistake  bound 
model  to  evaluate  learning  algorithms  for  this  problem.  First  of  all,  observe  that  we  want 
an  on-line  model.  There  is  no  training  phase  here,  the  allergist  wants  to  predict  the  correct 
testing  option  for  each  patient/allergen  pair.  Also  we  expect  that  the  allergist  has  time  to 
test  each  patient  for  each  allergen,  that  is  the  instance  space  is  polynomial-sized.  Thus  as 
discussed  in  Section  1  the  distribution-free  model  is  not  appropriate. 

How  should  we  judge  the  performance  of  the  learning  algorithm?  For  each  wrong  predic¬ 
tion  made,  a  patient  is  inconvenienced  with  making  a  second  trip  or  having  a  bad  reaction. 
Since  the  learner  wants  to  give  all  patients  the  best  possible  service,  he  strives  to  minimize 
the  number  of  incorrect  predictions  made.  Thus  we  want  to  use  the  absolute  mistake  bound 
success  criteria.  Namely,  we  judge  the  performance  of  the  learning  algorithm  by  the  number 
of  incorrect  predictions  made  during  a  learning  session  in  which  he  must  eventually  test  each 
patient  for  each  allergen. 

Up  to  now,  the  standard  on-line  model  (using  absolute  mistake  bounds  to  judge  the 
learners)  appears  to  be  the  appropriate  model.  We  now  discuss  the  selection  of  the  instances. 
Since  the  allergist  has  no  control  over  the  target  relation  (i.e.  that  allergies  of  his  patients), 
it  makes  sense  to  view  the  feedback  as  coming  from  an  adversary.  However,  do  we  really 
want  an  adversary  to  select  the  presentation  order  for  the  instances?  It  could  be  that  the 
allergist  is  working  for  a  cosmetic  company  and  due  to  restrictions  of  the  Food  and  Drug 
Administration  and  the  cosmetic  company  the  allergist  is  essentially  told  when  to  test  each 
person  for  each  allergen.  In  this  case,  it  is  appropriate  to  have  an  adversary  select  the 
presentation  order.  However,  in  the  typical  situation,  the  allergist  can  decide  in  what  order 
to  perform  the  testing  so  that  he  can  make  the  best  predictions  possible.  In  this  case,  we  want 
to  allow  the  learner  to  select  the  presentation  order.  One  could  also  imagine  the  situation 
where  a  intern  is  being  guided  by  an  experienced  allergist,  and  thus  a  teacher  helps  to  select 
the  presentation  order.  Finally,  random  selection  of  the  presentation  order  may  provide  us 
with  a  better  feeling  for  the  behavior  of  an  algorithm.  Thus  the  learning  model  that  is  most 
appropriate  for  this  example  is  the  extended  mistake  bound  model. 
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3.2  General  Mistake  Bounds 

In  this  section  we  begin  our  study  of  learning  fc-binary-relations  by  presenting  general  lower 
and  upper  bounds  on  the  mistakes  made  by  the  learner  regardless  of  the  director. 

Throughout  this  section,  we  using  the  following  notation.  We  say  an  entry  (*,j)  of  the 
matrix  (A/tJ)  is  known  if  the  learner  was  previously  presented  that  entry.  We  assume  without 
loss  of  generality  that  the  learner  is  never  asked  to  predict  the  value  of  a  known  entry.  We 
say  rows  i  and  i'  are  consistent  (given  the  current  state  of  knowledge)  if  M,j  =  M,>_,  for  all 
columns  j  in  which  both  entries  i,j  and  i' ,j  are  known. 

We  now  look  at  general  lower  and  upper  bounds  on  the  number  of  mistakes  that  apply  for 
all  directors.  Clearly,  any  learning  algorithm  makes  at  least  km  mistakes  for  some  matrix, 
regardless  of  the  query  sequence.  The  adversary  can  divide  the  rows  into  k  groups  and  reply 
that  the  prediction  was  incorrect  for  the  first  column  queried  for  each  entry  of  each  group. 
We  generalize  this  approach  to  force  mistakes  for  more  than  one  row  of  each  type. 

Theorem  1  For  any  0  <  0  <  1,  any  prediction  algorithm  makes  at  least  (1  —  0)km  -f 
n[lg(/3A:)J  —  (1  —  0)k[\g(0k)\  mistakes  regardless  of  the  query  sequence. 

Proof:  The  adversary  selects  its  feedback  for  the  learner’s  predictions  as  follows.  For  each 
entry  in  the  first  [lg  0k\  columns  the  adversary  replies  that  the  learner’s  response  is  incorrect. 
At  most  0k  new  row  types  are  created  by  this  action.  Likewise,  for  each  entry  in  the  first 
(1  —  0)k  rows  the  adversary  replies  that  the  learner’s  response  is  incorrect.  This  creates 
at  most  (1  —  0)k  new  row  types.  The  adversary  makes  all  remaining  entries  in  the  matrix 
zero.  (See  Figure  3.)  The  number  of  mistakes  is  the  area  of  the  unmarked  region.  Thus 
the  adversary  has  forced  at  least  (1  —  0)km  -f  n[\g(0k)\  —  (1  —  0)k[lg(0k)\  mistakes  while 
creating  at  most  0k  -f-  (1  —  0)k  =  k  row  types.  ■ 

By  letting  0  =  ^  we  obtain  the  following  corollary. 

Corollary  1  Any  algorithm  makes  at  least  -f  (n  —  |) [lg  k  —  lj  mistakes  in  the  worst  case 

regardless  of  the  query  sequence. 

If  computational  efficiency  is  not  a  concern,  for  all  query  sequences  the  halving  algo¬ 
rithm  [3,  15]  provides  a  good  mistake  bound. 


Observation  1  The  halving  algorithm  achieves  a  km  +  n  lg  k  mistake  bound. 


[lg(pk)J 


m  columns 


Figure  3:  The  final  matrix  created  by  the  adversary  in  the  proof  of  Theorem  1.  All  entries 
in  the  unmarked  area  will  contain  the  bit  not  predicted  by  the  learner.  That  is,  a  mistake  is 
forced  on  each  entry  in  the  unmarked  area.  All  entries  in  the  marked  area  will  be  zero. 

Proof:  We  use  a  simple  counting  argument  on  the  size  of  the  concept  class.  There  are  2km 
ways  to  select  the  k  row  types,  and  kn  ways  to  assign  one  of  the  k  row  types  to  each  of  the  n 
rows.  Thus  \C\  <  2 kmkn.  Littlestone  [15]  proves  that  the  halving  algorithm  makes  at  most 
lg  |C|  mistakes,  thus  the  number  of  mistakes  made  by  the  halving  algorithm  for  this  concept 
class  is  at  most  lg(2*mfcn)  <  km  +  n  Ig  k.  m 

In  the  remainder  of  this  section  we  study  efficient  prediction  algorithms  designed  to 
perform  well  against  each  of  the  directors.  For  some  directors  we  are  also  able  to  prove 
information-theoretic  lower  bounds  that  are  better  than  that  of  Theorem  1.  In  Section  3.3 
we  consider  the  case  that  the  query  sequence  is  selected  by  the  learner.  We  study  the  helpful- 
teacher  director  in  Section  3.4.  Finally,  in  Section  3.5  we  consider  the  case  of  an  adversary 
director. 

3.3  Self-Directed  Learning 

In  this  section,  we  present  an  efficient  algorithm  for  learning  the  matrix  for  the  case  in  which 
the  learner  is  the  director. 


Theorem  2  There  exists  a  polynomial  prediction  algorithm  that  achieves  a  km+(n—k)  fig  Jfc] 
mistake  bound  with  a  learner-selected  query  sequence. 

Proof:  The  query  sequence  selected  simply  specifies  the  entries  of  the  matrix  in  row-major 
order.  The  learner  begins  assuming  there  is  only  one  row  type.  Let  k  denote  the  learner’s 
current  estimate  for  k.  So  initially  k  =  1.  For  the  first  row,  the  learner  guesses  each  entry. 
(This  row  becomes  the  template  for  the  first  row  type.)  Next  the  learner  assumes  that  the 
second  row  is  the  same  as  the  first  row.  If  he  makes  a  mistake  then  the  learner  revises  his 
estimate  for  k  to  be  2,  guesses  for  the  rest  of  the  row,  and  uses  that  row  as  the  template  for 
the  second  row  type.  In  general,  to  predict  M,j,  the  learner  predicts  according  to  a  majority 
vote  of  the  recorded  row  templates  that  are  consistent  with  row  i  (breaking  ties  arbitrarily). 
Thus,  if  a  mistake  is  made,  then  at  least  half  of  the  row  types  can  be  eliminated  ?s  the 
potential  type  of  row  i.  If  more  than  fig  fcj  mistakes  are  made  in  a  row,  then  a  new  row  type 
has  been  found.  In  this  case,  k  is  incremented,  the  learner  guesses  for  the  rest  of  the  row, 
and  makes  this  row  the  template  for  row  type  k  +  1 . 

How  many  mistakes  are  made  by  this  algorithm?  Clearly,  at  most  m  mistakes  are  made 
for  the  first  row  found  of  each  of  the  k  types.  For  the  remaining  n  —  k  rows,  since  k  <  k,  at 
most  fig  k]  mistakes  are  made.  ■ 

Note  that  this  algorithm  need  not  know  k  a  priori.  Furthermore,  it  obtains  the  same 
mistake  bound  even  if  an  adversary  tells  the  learner  which  row  to  examine,  and  in  what 
order  to  predict  the  columns,  provided  that  the  learner  sees  all  of  a  row  before  going  on  to 
the  next.  We  note  that  this  upper  bound  is  within  a  constant  factor  of  the  lower  bound  of 
Corollary  1.  However,  this  problem  becomes  harder  if  the  adversary  can  select  the  query 
sequence  without  restriction. 

3.4  Teacher-Directed  Learning 

In  this  section,  we  present  upper  and  lower  bounds  on  the  number  of  mistakes  made  under 
the  helpful- teacher  director.  Recall  that  in  this  model,  we  consider  the  worst  case  mistake 
bound  over  all  consistent  learners.  Thus  the  question  asked  here  is,  what  is  the  minimum 
number  of  matrix  entries  a  teacher  must  reveal  so  that  there  is  a  unique  completion  of  the 
matrix.  That  is,  until  there  is  a  unique  completion  of  the  partial  matrix,  a  mistake  could  be 
made  on  the  next  prediction. 
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We  now  prove  an  upper  bound  on  the  number  of  entries  needed  to  uniquely  define  the 
target  matrix. 

Theorem  3  The  number  of  mistakes  made  with  a  helpful  teacher  as  the  director  is  at  most 
km  -f  (n  —  k)(k  —  1). 

Proof:  First,  the  teacher  presents  the  learner  with  one  row  of  each  type.  For  each  of  the 
remaining  n  —  k  rows  the  teacher  presents  an  entry  to  distinguish  the  given  row  from  each  of 
the  k  —  1  incorrect  row  types.  After  these  km  +  (n  —  k)(k  —  1)  entries  have  been  presented 
we  claim  that  there  is  a  unique  matrix  with  at  most  k  row  types  that  is  consistent  with 
the  partial  matrix.  Since  all  k  distinct  row  types  have  been  revealed  in  the  first  stage,  all 
remaining  rows  must  be  the  same  as  one  of  the  first  k  rows  presented.  However,  each  of  the 
remaining  rows  have  been  shown  to  be  inconsistent  with  all  but  one  of  these  k  row  templates. 
■ 

Is  Theorem  3  the  best  possible  such  result?  Clearly  the  teacher  must  present  a  row  of 
each  type.  But,  in  general,  is  it  really  necessary  to  present  A:  —  1  entries  of  the  remaining  rows 
to  uniquely  define  the  matrix?  We  now  answer  this  question  in  the  affirmative  by  presenting 
a  matching  lower  bound. 

Theorem  4  The  number  of  mistakes  made  with  a  helpful  teacher  as  the  director  is  at  least 
min{nm,  km  +  (n  —  k)(k  —  1)}. 

Proof:  The  adversary  selects  the  following  matrix.  The  first  row  type  consist  of  all  zeros. 
For  2  <  z  <  min{m  -f-  l,fc},  rows  type  z  contains  z  —  2  zeros,  followed  by  a  one,  followed 
by  m  —  z  -f  1  zeros.  The  first  k  rows  are  each  assigned  to  be  a  different  one  of  the  k  row 
types.  Each  remaining  row  is  assigned  to  be  the  first  row  type.  (See  Figure  4.)  Until  there 
is  a  unique  completion  of  the  partial  matrix,  by  definition  there  exists  a  consistent  learner 
that  could  make  a  mistake.  Clearly  if  the  learner  has  not  seen  each  column  of  each  row  type, 
then  the  final  matrix  is  not  uniquely  defined.  This  part  of  the  argument  accounts  for  km 
mistakes.  When  m  +  1  >  k,  for  the  remaining  rows  unless  all  of  the  first  k  —  I  columns  are 
known,  there  is  some  row  type  besides  the  first  row  type  that  must  be  consistent  with  the 
given  row.  This  argument  accounts  for  (n  —  k)(k  —  1)  mistakes.  Likewise,  when  m  +  1  <  k 
then  if  any  of  the  first  m  columns  are  not  known  then  there  is  some  row  type  besides  the  first 
row  type  that  must  be  consistent  with  the  given  row.  This  accounts  for  (n  —  k)m  mistakes. 
Thus  the  total  number  of  mistakes  is  at  least  min{nm,  km  +  (n  —  k)(k  —  1)}.  ■ 
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5  row 
types 


(000000000 
100000000 
010000000 
001000000 
000100000 
000000000 
• 

000000000 


Figure  4:  The  matrix  created  by  the  adversary  against  the  helpful  teacher  director.  In  this 
example,  there  are  5  row  types  which  appear  in  the  first  five  rows  of  the  matrix. 

We  note  that  due  to  the  restriction  that  the  mistake  bound  in  the  helpful  teacher  director 
apply  for  all  consistent  learners,  it  is  possible  to  get  mistake  bounds  that  are  not  as  good  as 
the  bounds  obtained  when  the  learner  is  self-directed.  Recall  that  in  the  previous  section, 
we  proved  an  km  +  (n  —  k)  [lg  k~\  mistake  bound  under  the  learner  director.  This  bound  is 
better  than  that  obtained  with  a  teacher  because  the  learner  uses  a  majority  vote  among  the 
known  row  types  for  making  predictions.  However,  a  consistent  learner  may  use  a  minority 
vote  and  could  thus  make  km  +  (n  —  k)(k  —  1)  mistakes. 

3.5  Adversary-Directed  Learning 

In  this  section  we  derive  upper  and  lower  bounds  on  the  number  of  mistakes  made  by  any 
learning  algorithm  for  an  adversary  as  the  director.  We  first  present  an  information-theoretic 
lower  bound  on  the  number  of  mistakes  an  adversary  can  force  the  learner  to  make.  Next,  we 
present  an  efficient  prediction  algorithm  that  achieves  an  optimal  mistake  bound  if  k  <  2. 
Next  we  consider  the  related  problem  of  computing  the  minimum  number  of  row  types 
needed  to  complete  a  partially  known  matrix.  We  then  consider  learning  algorithms  that 
work  against  an  adversary  for  arbitrary  k.  We  present  an  upper  and  lower  bound  on  the 
number  of  mistakes  made  by  a  specific  type  of  algorithm. 

We  now  present  an  information-theoretic  lower  bound  on  the  number  of  mistakes  made 
by  any  prediction  algorithm  when  the  adversary  selects  the  query  sequence.  We  obtain  this 
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result  by  modifying  the  technique  used  in  Theorem  1. 


Theorem  5  Any  prediction  algorithm  makes  at  least  min{nro,  krn  +  ( n  —  fc)[lg  fcj  }  mistakes 
against  an  adversary-selected  query  sequence. 

Proof:  The  adversary  starts  by  presenting  all  entries  in  the  first  [_lg  k\  columns  (or  m 
columns  if  m  <  [lg  &J )  and  replying  that  each  prediction  is  incorrect.  If  m  >  [lg  k\ ,  this 
step  causes  the  learner  to  make  n|_lg&J  mistakes.  Otherwise,  this  step  causes  the  learner 
to  make  nm  mistakes.  Each  row  can  now  be  classified  as  one  of  k  row  types.  Next  the 
adversary  presents  the  remaining  columns  for  one  row  of  each  type,  again  replying  that  each 
prediction  is  incorrect.  For  m  >  (Jg^J  this  step  causes  the  learner  to  make  k(m  --  [lg&J) 
additional  mistakes.  For  the  remaining  matrix  entries,  the  adversary  replies  as  dictated  by 
the  completed  row  of  the  same  row  type  as  the  given  row.  So  the  number  of  mistakes  made 
by  the  learner  is  at  least  min{nm,n[lg  k\  +  km  —  fc[lg&J}  =  min {nm,km-\-  (n  —  Ar)[JgfcJ}. 


3.5.1  Special  Case:  k  =  2 

We  now  consider  efficient  prediction  algorithms  for  learning  the  matrix  under  an  adversary- 
selected  query  sequence.  (Recall  that  if  efficiency  is  not  a  concern  the  halving  algorithm 
makes  at  most  km  -f-  (n  —  k)  lg  k  mistakes.)  In  this  section  we  consider  the  case  that  k  <  2, 
and  present  an  efficient  prediction  algorithm  that  performs  optimally. 

Theorem  6  There  exists  a  polynomial  prediction  algorithm  that  makes  at  most  2m  +  n  —  2 
mistakes  against  an  adversary-selected  query  sequence  for  k  =  2. 

Proof:  The  algorithm  uses  a  graph  G  whose  vertices  are  the  rows  of  the  matrix  and  that 
initially  has  no  edges.  To  predict  the  algorithm  2-colors  the  graph  G,  and  then: 

1.  If  no  entry  of  column  j  is  known,  it  guesses  randomly. 

2.  Else  if  every  known  entry  of  column  j  is  zero  (respectively,  one),  it  guesses  zero  (one). 

3.  Else  it  finds  a  row  i'  of  the  same  color  as  i  and  known  in  column  j,  and  guesses  M-'j. 

Finally,  after  the  prediction  is  made  and  the  feedback  received,  the  graph  G  is  updated  by 
adding  an  edge  ii'  to  G  for  each  row  i'  known  in  column  j  for  which  Mij  ^  M^j. Note  that 
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Figure  5:  The  situation  occurring  if  G  does  not  have  a  unique  coloring  after  n  —  1  edges 
have  been  added.  The  thick  grey  edges  and  the  thick  black  edge  show  the  cycle  in  G.  Let  e 
(shown  as  a  thick  black  edge)  be  the  edge  added  to  form  the  cycl^. 

one  of  the  above  cases  always  applies.  Also,  since  k  =  2,  it  will  always  be  possible  to  find  a 
2- coloring. 

How  many  mistakes  can  this  algorithm  make?  It  is  not  hard  to  see  that  cases  1  and  2 
each  occur  only  once  for  every  column,  so  there  are  at  most  m  mistakes  made  in  each  of 
these  cases.  Furthermore,  the  first  case  2  mistake  adds  at  least  one  edge  to  G.  When  a 
mistake  is  made  in  case  3,  the  algorithm  learns  that  two  rows  are  different,  thus  adding  at 
least  one  edge  to  G  between  two  nodes  currently  of  the  same  color.  So  after  at  most  n  —  2 
case  3  mistakes,  G  has  n  —  1  edges. 

We  now  prove  that  when  G  has  n  —  1  edges  there  is  a  unique  2-coloring2  of  G.  Suppose 
there  is  not  a  unique  2-coloring.  This  assumption  implies  that  there  exists  a  set  S  of  nodes, 
where  S  contains  at  least  one  node  and  at  most  n  —  1  nodes,  such  that  the  color  of  all  nodes  in 
S  can  be  reversed  without  causing  two  adjacent  nodes  to  be  the  same  color.  Thus  it  follows 
that  G  is  not  connected.  However  G  has  n  —  1  edges  and  thus  there  exists  some  connected 
component  of  G  that  must  have  a  cycle.  (See  Figure  5.)  We  now  separately  consider  the 
cases  that  this  cycle  contains  an  odd  number  of  edges  or  an  even  number  of  edges. 

•  Case  1:  Odd-length  cycle.  Let  e  =  v\v 2  be  the  last  edge  placed  in  the  cycle. 
Consider  the  2-coloring  of  G  during  the  step  of  the  algorithm  when  e  was  added.  Since 
vx  and  v2  were  connected  by  an  even  number  of  edges,  in  any  legal  2-coloring  they 
must  have  been  the  same  color.  Thus  we  get  a  contradiction  since  a  mistake  could  not 

2Two  colorings  under  renaming  of  the  colors  are  considered  to  be  the  same. 
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have  occurred. 


•  Case  2:  Even-length  cycle.  Consider  the  last  edge  e  =  v \V2  placed  in  the  cycle. 
Before  e  was  added,  since  tq  and  v2  were  connected  by  an  odd  number  of  edges,  in  any 
legal  2-coloring  they  must  have  been  different  colors.  Since  Step  3  of  the  algorithm 
picks  nodes  of  the  same  color,  an  edge  could  have  never  been  placed  between  iq  and 
i>2-  Thus  we  again  have  a  contradiction. 

In  both  cases  we  reach  a  contradiction,  and  thus  we  have  shown  that  after  G  has  n  —  1  edges 
there  is  a  unique  2-coloring.  So  after  at  most  n  —  2  case  3  mistakes,  there  must  be  a  unique 
coloring  of  G  and  no  more  mistakes  can  occur.  Thus,  the  worst  case  number  of  mistakes 
made  by  this  algorithm  is  2m  -f  n  —  2.  ■ 

Note  that  for  k  =  2  this  upper  bound  matches  the  information-theoretic  lower  bound 
of  Theorem  5.  We  also  note  that  if  there  is  only  one  row  type  then  the  algorithm  given  in 
Theorem  6  makes  at  most  m  mistakes,  matching  the  information-theoretic  lower  bound. 

An  interesting  theoretical  question  is  to  find  a  linear  mistake  bound  for  constant  k  >  3 
when  provided  with  a  fc-colorability  oracle.  However,  such  an  approach  would  have  to 
be  greatly  modified  to  yield  a  polynomial  prediction  algorithm  since  a  polynomial-time 
/c-colorability  oracle  exists  only  if  V  =  NV.  Furthermore,  even  good  polynomial  time 
approximations  to  a  fc-colorability  oracle  are  not  known  [4,  14]. 

The  remainder  of  this  section  focuses  on  designing  polynomial  prediction  algorithms  for 
the  case  that  the  matrix  has  at  least  three  row  types.  One  approach  that  may  seem  promising 
is  to  make  predictions  as  follows.  Compute  a  matrix  that  is  consistent  with  all  known  entries 
and  that  has  the  fewest  possible  row  types.  Then  use  this  matrix  to  make  the  next  prediction. 
We  now  show  that  even  computing  the  minimum  number  of  row  types  needed  to  complete 
a  partially  known  matrix  is  JV/’'P-complete.  Formally,  we  define  the  matrix  k~complexity 
problem  as  follows:  given  an  n  x  m  binary  matrix  M  that  is  partially  known,  decide  if  there 
is  some  matrix  with  at  most  k  row  types  that  is  consistent  with  A/.  The  matrix  ^-complexity 
problem  can  be  shown  to  be  WP-complete  by  a  reduction  from  graph  A;-colorability  for  the 
cases  where  k  >  2  and  m  >  n. 

Theorem  7  For  k  >  2  and  m>n,  the  matrix  k-complexity  problem  is  NV -complete. 

Proof:  We  use  a  reduction  from  graph  A;-colorability.  Given  an  instance  G  =  {V,E)  of 
graph  fc-colorability  we  transform  it  into  an  instance  of  the  matrix  ^-complexity  problem. 
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Figure  6:  An  example  of  the  reduction  used  in  Theorem  7.  The  graph  G  is  the  instance  for 
the  graph  coloring  problem.  The  partial  matrix  M  is  the  instance  for  the  matrix  complexity 
problem.  We  note  that  there  exists  a  matrix  that  is  an  completion  of  M  that  uses  only  three 
row  types.  The  corresponding  3-coloring  of  G  is  demonstrated  by  the  node  colorings  used  in 
G. 

Let  n  =  |Vj.  For  each  edge  {u,, Vj}  £  E,  we  add  entries  to  the  matrix  so  that  row  i  and  row 
j  cannot  be  the  same  row  type.  Specifically,  for  each  vertex  u,,  we  set  M„  =  0,  and  Afj,  =  1 
for  each  neighbor  Vj  of  t>,.  An  example  demonstrating  this  reduction  is  given  in  Figure  6. 

We  now  show  that  there  is  some  matrix  of  at  most  k  row  types  that  is  consistent  with 
this  partial  matrix  if  and  only  if  G  is  fc-colorable.  We  first  argue  that  if  there  is  a  matrix  M' 
consistent  with  M  that  has  at  most  k  row  types  then  G  is  ^-colorable.  By  the  construction 
if  two  rows  are  of  the  same  type  there  cannot  be  an  edge  between  the  corresponding  nodes. 
So  just  let  the  node  color  for  each  node  be  the  type  of  the  corresponding  row  in  M'. 

We  now  argue  that  if  G  is  fc-colorable,  then  there  exists  a  matrix  M'  consistent  with  M 
that  has  at  most  k  row  types.  By  the  construction  of  M,  if  a  set  of  vertices  are  the  same 
color  in  G  then  the  corresponding  rows  are  consistent  with  each  other.  Thus  there  exists  a 
matrix  with  at  most  k  row  types  that  is  consistent  with  M.  m 
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3.5.2  Row-Filter  Algorithms 

In  this  section  we  study  the  performance  of  a  whole  class  of  algorithms  designed  to  learn  a 
matrix  with  arbitrary  complexity  k  when  an  adversary  selects  the  query  sequence.  We  say 
that  an  algorithm  A  is  a  row- filter  algorithm  if  A  makes  its  prediction  for  M,j  strictly  as  a 
function  of  j  and  all  entries  in  the  set  I  of  rows  consistent  with  row  i  and  defined  in  column 
j.  That  is,  A’s  prediction  is  f(I,j)  where  /  is  some  (possibly  probabilistic)  function.  So,  to 
make  a  prediction  for  M(J  ,  a  row-filter  algorithm  considers  all  rows  that  could  be  the  same 
type  as  row  i  and  whose  value  for  column  j  is  known,  and  uses  these  rows  in  any  way  one 
could  imagine  to  make  a  prediction.  For  example  it  could  take  a  majority  vote  on  the  entries 
in  column  j  of  all  rows  that  are  consistent  with  row  i.  Or,  of  the  rows  defined  in  column  j, 
it  could  select  the  row  that  has  the  most  known  values  in  common  with  row  i  and  predict 
according  to  its  entry  in  column  j.  We  have  found  that  many  of  the  prediction  algorithms 
we  considered  are  row-filter  algorithms. 

Consider  the  simple  row-filter  algorithm,  ConsMajorityPredict ,  in  which  f(I,j)  computes 
the  majority  vote  of  the  entries  in  column  j  of  the  rows  in  /.  (Guess  randomly  in  the  case  of 
a  tie.)  Note  that  ConsMajorityPredict  only  takes  time  linear  in  the  number  of  known  entries 
of  the  matrix  to  make  a  prediction.  We  now  give  an  upper  bound  on  the  number  of  mistakes 
made  by  ConsMajorityPredict. 

Theorem  8  The  algorithm  ConsMajorityPredict  makes  at  most  km-\-n^(k  —  l)m  mistakes 
against  an  adversary-selected  query  sequence. 

Proof:  For  all  i ,  let  d(i)  be  the  number  of  rows  consistent  with  row  i.  We  define  the  potential 
of  a  partially  known  matrix  to  be  $  =  d(i).  We  begin  by  considering  how  much  the 

potential  function  can  change  over  the  entire  learning  session. 

Lemma  1  The  potential  function  $  decreases  by  at  most  n 2  during  the  learning  session. 
Proof:  Initially  for  all  i,  d(i)  =  n.  So  $init  =  n2.  Let  C(z)  be  the  number  of  rows  of  type  2 
for  1  <  <  k.  By  definition,  $finaj  =  Z)z=i  C(z)2-  Thus  our  goal  is  to  minimize  C(z )2 

under  the  constraint  that  £*=1  C(z )  =  n.  Using  the  method  of  Lagrange  multipliers  we 
obtain  that  $finai  is  minimized  when  for  all  z,  C(z)  =  nfk.  Thus  4>fina]  >  ( n/k)2k  =  n2/k. 
So  A<t>  =  *init  -  <Dfinal  <  n2  -  £  =  V"2-  ■ 

Now  that  the  total  decrease  in  $  over  the  learning  session  is  bounded,  we  need  to  deter¬ 
mine  how  many  mistakes  can  be  made  without  <I>  decreasing  by  more  than  ^jp^2.  We  begin 
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by  noting  that  $  is  strictly  decreasing.  Once  two  rows  are  found  to  be  inconsistent,  they 
remain  inconsistent.  So  for  each  i,  d(i)  is  strictly  decreasing,  and  thus  $  is  strictly  decreas¬ 
ing.  So  to  bound  the  number  of  mistakes  made  by  ConsMajorityPredict  we  must  compute 
a  lower  bound  on  the  amount  $  is  decreased  by  each  mistake.  Intuitively,  one  expects  $  to 
decrease  by  larger  amounts  as  more  of  the  matrix  is  seen.  We  formalize  this  intuition  in  the 
next  two  lemmas. 

Lemma  2  The  rth  mistake  made  when  predicting  an  entry  Mij  in  column  j  of  some  row  i 
of  type  z  decreases  $  by  at  least  2 (r  —  1). 

Proof:  Consider  all  the  rows  of  type  z.  Since  r  —  1  mistakes  have  occurred  in  column  j 
of  these  rows,  at  least  r  —  1  entries  are  known  in  column  j  of  rows  of  type  z.  Since  these 
rows  must  be  in  7,  if  a  mistake  occurs  there  must  be  at  least  r  —  1  entries  in  /  (and  thus 
consistent  with  row  i)  that  differ  in  column  j  with  row  i.  Thus  if  a  mistake  is  made,  row  i  is 
found  to  be  inconsistent  with  at  least  r  —  1  rows  it  was  thought  to  be  consistent  with.  When 
two  previously  consistent  rows  are  found  to  be  inconsistent,  $  decreases  by  two.  Thus  the 
total  decrease  in  caused  by  the  rth  mistake  made  when  predicting  an  entry  in  column  j  of 
some  row  of  type  z  decreases  $  by  at  least  2(r  —  1).  ■ 

From  Lernma  1,  we  see  that  the  more  entries  that  are  known  in  a  given  column  of  a  given 
row  type,  the  greater  the  decrease  in  $  for  future  mistakes  on  such  entries.  So,  intuitively 
it  appears  that  the  adversary  can  maximize  the  number  of  mistakes  made  by  the  learner 
by  balancing  the  number  of  entries  seen  for  each  column  of  each  row  type.  We  prove  that 
this  intuition  is  correct  and  apply  it  to  obtain  a  lower  bound  on  the  amount  $  must  have 
decreased  after  the  learner  has  made  p  mistakes. 

Lemma  3  After  p  mistakes  are  made,  the  total  decrease  in  <!>  is  at  least  km  —  1 

Proof:  From  Lemma  2,  after  the  rlh  mistake  in  column  j  of  row  type  z,  the  total  decrease 
in  $  from  its  initial  value  is  at  least  2(x  -  1)  >  (r  -  l)2.  Let  W(j,  z)  be  the  number  of 
mistakes  made  in  column  j  of  rows  of  type  c.  The  total  decrease  in  $  is  at  least 

J=I  .=1 

subject  to  the  constraint  £2”Li  £*=1  W(j,  r)  —  tl ■ 


Using  the  method  of  Lagrange  multipliers,  we  obtain  that  d  is  minimised  when  W(j,z)  = 
A-  for  all  j  and  z.  So  the  total  decrease  in  $  is  at  least 


We  now  complete  the  proof  of  the  theorem.  Combining  Lemma  1  and  Lemma  3  along 
with  the  observation  that  $  is  strictly  non-increasing,  we  have  shown  that  when 


then  $  must  have  decreased  as  much  as  it  can  and  thus  no  more  mistakes  will  occur.  Solving 
for  n  we  obtain  that  $  has  decreased  by  its  maximum  amount  when  p  =  km  +  nJ(k  —  l)m. 


We  note  that  by  using  the  simpler  argument  that  each  mistake,  except  for  the  first  mistake 
in  each  column  of  each  row  type,  decreases  $  by  at  least  2,  we  obtain  a  km  +  ^-n2  mistake 
bound  for  any  row-filter  algorithm.  Also,  Manfred  Warmuth  [24]  has  independently  given 
an  algorithm,  based  on  the  weighted  majority  algorithm  of  Littlestone  and  Warmuth  [16], 
that  achieves  an  0(km  +  ny/m  lg  k)  mistake  bound.  Warmuth’s  algorithm  builds  a  complete 
graph  of  n  vertices  where  row  i  corresponds  to  vertex  u,  and  all  edges  get  an  initial  weight  of 
1.  To  predict  a  value  for  (i,j)  the  learner  takes  a  weighted  majority  of  all  active  neighbors 
of  v,  ( Vk  is  active  if  (k,j)  is  known).  After  receiving  feedback,  the  learner  sets  the  weight  on 
the  edge  from  u,  to  vk  to  be  0  if  (k,  j)  ^  (i,  j).  Finally,  if  a  mistake  occurs  the  learner  doubles 
the  weight  of  (vi,vk)  if  ( k,j )  =  ( i,j )  (i.e.  the  edges  to  neighbors  that  predicted  correctly). 
We  note  that  this  algorithm  is  not  a  row-filter  algorithm. 

Does  ConsMajorityPredict  give  the  best  performance  possible  by  a  row-filter  algorithm? 
We  now  present  an  information-theoretic  lower  bound  on  the  number  of  mistakes  an  adver¬ 
sary  can  force  against  any  row-filter  algorithm. 

Theorem  9  Lei  p  be  a  prime  and  let  m  =  (p2-t-p  +  1).  Any  row-filltr  algorithm  for  learning 
a  2 n  x  m  matrix  with  m  >  n  and  k  >  2  makes  at  least  n(p  +  1)  =  fi (n^/m)  mistakes  when 
the  adversary  selects  the  query  sequence. 

Proof:  We  assume  that  the  adversary  knows  the  learner’s  algorithm  and  has  access  to  any 
random  bits  he  uses.  (One  can  prove  a  similar  lower  bound  on  the  expected  mistake  bound 
when  the  adversary  cannot  access  the  random  bits.) 


Figure  7:  A  projective  geometry  for  p  =  2,  m  =  7. 


Our  proof  depends  upon  the  existence  of  a  projective  geometry  F  on  m  points  and  lines  [5]. 
That  is,  there  exists  a  set  of  m  points  and  a  set  of  m  lines  such  that  each  line  contains  exactly 
p-f  1  points  and  each  point  is  at  the  intersection  of  exactly  p+  1  lines.  Furthermore,  any  pair 
of  lines  intersects  at  exactly  one  point,  and  any  two  points  define  exactly  one  line.  Figure  7 
shows  a  matrix  representation  of  such  a  geometry;  an  “x”  in  entry  i,j  indicates  that  point 
j  is  on  line  i.  We  use  the  first  n  lines  of  T. 

The  matrix  M  consists  of  two  row  types:  the  odd  rows  are  filled  with  ones  and  the 
even  rows  with  zeros.  Imagine  assigning  two  rows  of  M  to  each  line  of  T.  (See  Figure  8). 
We  now  prove  that  the  adversary  can  force  a  mistake  for  each  entry  of  T.  The  adversary’s 
query  sequence  maintains  the  condition  that  an  entry  i,j  is  not  revealed  unless  line  fi/2] 
of  T  contains  point  j.  In  particular,  the  adversary  will  begin  by  presenting  one  entry  of  the 
matrix  for  each  entry  of  T.  We  prove  that  for  each  entry  of  T  the  learner  must  predict  the 
same  value  for  the  two  corresponding  entries  of  the  matrix.  Thus  the  adversary  forces  a 
mistake  for  the  n(p  +  1)  =  Q(ny/m)  entries  of  T.  The  remaining  entries  of  the  matrix  are 
then  presented  in  any  order. 

Let  I  be  the  set  of  rows  that  may  be  used  be  the  row-filter  algorithm  when  predicting 
entry  (2 i,j).  Let  /'  be  the  set  of  rows  that  may  be  used  by  the  row-filter  algorithm  when 
predicting  entry  (2 i  —  1  ,j).  We  prove  by  contradiction  that  1=1'.  If  1  ^  I'  then  it  must 
be  the  case  that  there  is  some  row  r  that  is  defined  in  column  j  and  consistent  with  row  2 i, 
yet  inconsistent  with  row  2i  —  1 .  By  definition  of  the  adversary’s  query  sequence  it  must  be 
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2  i-1, j 

2  i,  j 

Figure  8:  The  matrix  created  by  the  adversary  in  the  proof  of  Theorem  9.  The  shaded 
regions  correspond  to  the  entries  in  the  first  n  lines  of  I\  The  learner  is  forced  to  make  a 
mistake  on  one  of  the  entries  in  each  shaded  rectangle. 

the  case  that  lines  [r/2]  and  f(2i  —  l)/2]  of  T  contain  point  j.  Furthermore,  since  2 i  —  1,  j 
is  being  queried  that  entry  is  not  known.  Thus  rows  r  and  2i  —  1  must  both  be  known  in 
some  other  column  j'  since  they  are  known  to  be  inconsistent.  Thus  since  only  entries  in  T 
are  shown,  it  follows  that  lines  [r/2]  and  [(2 i  —  l)/2]  of  T  also  contain  point  j'  for  j'  ^  j. 
So,  this  implies  that  lines  [r/2]  and  [(2i  —  l)/2]  of  T  must  intersect  at  two  points  giving  a 
contradiction.  Thus  1  =  1'  and  so  /(/,/)  =  )  for  entry  (2 i,j)  and  entry  (2 i  —  l,j). 

Since  the  rows  differ  in  each  column  the  adversary  can  force  a  mistake  on  one  of  these  entries. 
Since  the  adversary  has  access  to  the  random  bits  of  the  learner,  he  can  compute  just 

before  making  his  query,  and  ask  the  learner  to  predict  the  entry  for  which  the  mistake  will 
be  made.  This  procedure  is  repeated  for  the  pair  of  entries  corresponding  to  each  element 

of  r.  ■ 

We  use  a  similar  argument  to  get  an  f)(mv/n)  bound  for  m  <  n.  Combined  with  the  lower 
bound  of  Theorem  5  and  Theorem  9  we  obtain  a  Q(km  +  (n  —  &)[lg  fcj  +  min {n y/rn,  my/n}) 
lower  bound  on  the  number  of  mistakes  made  by  a  row-filter  algorithm. 

Corollary  2  Any  row-filter  algorithm  makes  Q(km  +  (n  —  &)[lg£J  +  minfny^,  m^/n}) 
mistakes  against  an  adversary-selected  query  sequence. 

Comparing  this  lower  bound  to  the  upper  bound  proven  for  C onsM a jority Predict,  we  see 
that  for  fixed  k  the  mistake  bound  of  ConsMajority Predict  is  within  a  constant  factor  of 
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optimal. 

Given  this  lower  bound,  one  may  question  the  2m  +  n  —  2  upper  bound  for  k  =  2  given  in 
Theorem  6.  However,  the  algorithm  described  is  not  a  row-filter  algorithm.  Also  compared 
to  our  results  for  the  learner-selected  query  sequence,  it  appears  that  allowing  the  learner  to 
select  the  query  sequence  is  quite  helpful. 

3.6  Randomly-Directed  Learning 

In  this  section  we  consider  the  case  that  the  learner  is  presented  at  each  step  with  one  of 
the  remaining  entries  of  the  matrix  selected  uniformly  and  independently  at  random.  We 
present  a  prediction  algorithm  that  makes  0(km  +  nk\fH )  mistakes  on  average  where  H  is 
the  maximum  Hamming  distance  between  any  two  rows  of  the  matrix.  We  note  that  when 
H  =  fl(x)  the  result  of  Theorem  8  supersedes  this  result.  A  key  result  of  this  section  is  a 
proof  relating  two  different  probabilistic  models  for  analyzing  the  mistake  bounds  under  a 
random  presentation.  We  first  consider  a  simple  probabilistic  model  in  which  the  requirement 
that  t  matrix  entries  is  known  is  simulated  by  assuming  that  each  entry  of  the  matrix  is  seen 
independently  with  probability  We  then  prove  that  any  upper  bound  obtained  on  the 
number  of  mistakes  under  this  simple  probabilistic  model  holds  under  the  true  model  (to 
within  a  constant  factor)  in  which  we  have  exactly  t  entries  known.  This  result  is  extremely 
useful  since  in  the  true  model  the  dependencies  among  the  probabilities  that  matrix  entries 
are  known  makes  the  prove  significantly  more  difficult. 

We  define  the  algorithm  RandomConsistentPredict  to  be  the  row-filter  algorithm  where 
the  learner  makes  his  prediction  for  Mij  by  choosing  one  row  i'  of  /  uniformly  at  random  and 
predicting  the  value  of  Myj.  (If  /  is  empty  then  RandomConsistentPredict  makes  a  random 
guess.) 

Theorem  10  Let  H  be  the  maximum  Hamming  distance  between  any  two  rows  of  M .  Then 
the  expected  number  of  mistakes  made  by  RandomConsistentPredict  is  0(k(n\/H  -f  m)). 

Proof:  Let  Ut  be  the  probability  that  the  prediction  rule  makes  a  mistake  on  the  ( t  +  l)st 
step.  That  is,  Ut  is  the  chance  that  a  prediction  error  occurs  on  the  next  randomly  selected 
entry  given  that  exactly  t  other  randomly  chosen  entries  are  already  known.  Clearly,  the 
expected  number  of  mistakes  is  J2t=o  Ut,  where  S  =  nm.  Our  goal  is  to  find  an  upper  bound 
for  this  sum. 
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The  condition  that  exactly  t  entries  are  known  makes  the  computation  of  Ut  rather  messy 
since  the  probability  of  having  seen  some  entry  of  the  matrix  is  not  independent  of  knowing 
the  others.  Instead,  we  compute  the  probability  Vt  of  a  mistake  under  the  simpler  assumption 
that  each  entry  of  the  matrix  has  been  seen  with  probability  t/S,  independent  of  the  rest  of 
the  matrix.  We  first  compute  an  upper  bound  for  the  sum  HfrTo  Vt,  and  then  show  that  this 
sum  is  within  a  constant  factor  of  Y,t=a  Ut- 

Lemma  4  Y,t=o  =  0(km  +  nk\/H). 

Proof:  Fix  f,  and  let  p  =  t/S.  For  any  row  i  we  define  d(i)  to  be  the  number  of  rows  of 
the  same  type  as  row  i.  We  first  prove,  that  without  loss  of  generality,  we  can  assume  that 
d(i )  >  1  for  all  1  <  i  <  n.  Namely,  if  this  lemma  holds  for  this  restricted  case,  then  it  holds 
in  the  general  case.  Suppose  there  are  n!  rows  that  are  each  different  from  all  other  rows. 
Applying  this  lemma  to  the  relation  obtained  by  removing  the  n'  objects  that  have  distinct 
row  types  and  then  adding  in  the  mistakes  for  these  removed  objects  we  get: 
s-i 

YVt  =  Q  (( k  —  n')m  -f-  n'(k  —  n')\/JP)  -f  n'm 
t= o 

=  0(km  -f  n'(fc  —  n')y/H) 

—  0(km  +  nkVH). 

Thus  for  the  remainder  of  this  proof  we  may  assume  that  d(i)  >  1  for  all  i. 

We  compute  Vt  as 

Vt  =  —  ^T]Pr[a  is  made  |  is  unknown  and  i,j  is  presented  next].  (1) 

If  some  of  the  rows  known  in  column  j  are  consistent  with  row  i,  then  the  probability  of  a 
mistake  in  row  *,  column  j  is  the  chance  that  one  of  these  rows  i'  (chosen  at  random)  differs 
from  row  i  in  the  jth  column,  or 

Pr [Mij  ^  Mi'j  A  rows  i,i'  are  consistent] 

Pr[rows  i,  i'  are  consistent] 

(For  brevity,  we  omit  the  implicit  conditions  that  entry  i,j  is  unknown  and  i',  j  is  known.) 

We  first  focus  on  the  numerator  of  Expression  2.  We  define  r0  as  the  expected  number 
of  rows  of  the  same  type  as  row  i  and  defined  in  column  j.  Since  in  the  d(i)  —  1  rows  of  the 
same  type  as  row  i,  column  j  is  known  with  probability  p, 


r0  =  (d(i)  -  l)p. 
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Next  we  define  rx  to  be  the  expected  number  of  rows  of  a  different  type  from  row  i  that 
are  consistent  with  row  i  and  defined  in  column  j.  Consider  a  column,  c,  in  which  some  row 
i'  differs  from  row  i.  With  probability  1  —  p2  either  i  or  i'  is  not  seen  in  column  c  and  thus 
row  i  and  i'  are  not  detected  to  be  inconsistent  in  column  c.  We  know  that  row  i  is  not 
known  in  column  j  and  thus  the  number  of  columns  for  which  we  may  detect  that  z  and  i' 
are  different  depends  on  whether  i  and  i'  differ  in  column  j.  Let  h(i,i')  be  the  Hamming 
distance  between  rows  i  and  i1.  Let  h'(i,i')  be  the  number  of  column  in  which  i  and  i'  differ 
excluding  column  j.  Since  column  j  of  row  i'  is  seen  with  probability  p, 

Finally,  since  (1  —  p2)  <  1  and  >  h(i ,  i')  —  1, 

n  <  5^p(l  -  p2)^*-*')-1. 

Finally,  we  define  r-i  to  be  the  expected  number  of  rows  that  are  consistent  with  row  i 
and  defined  in  column  j ,  but  different  from  row  i  in  column  j.  Only  the  rows  meeting  the 
conditions  for  rx  are  considered  in  the  set  /,  and  with  probability  of  ^  ^ ,  the  row  i'  differs 
from  row  i  in  column  j.  Thus 

r2<£—  p(l  -  . 

We  can  now  evaluate  the  numerator  of  Expression  (2).  Namely, 

V  Pr[M,j  ^  Mi'j  A  rows  z,  i  are  consistent]  =  — — —  <  — . 
it#  ro  +  O  ro 

We  now  compute  a  lower  bound  on  the  denominator  of  Equation  2  so  that  we  can  get 
an  upper  bound  on  the  entire  expression.  Since  two  rows  of  the  same  row  type  are  always 
consistent,  it  follows  that 

r.  Pr[rows  i,  i  are  consistent]  >  d(i)  —  1. 

i># 

Recall  Expression  (2)  for  the  probability  of  making  a  mistake  when  predicting  M,j  given 
that  some  row  consistent  with  row  i  was  defined  in  column  j.  An  error  can  also  occur  when 
there  are  no  consistent  rows  known  in  column  j.  The  situation  occurs  with  probability  at 
most  (1  —  p)d(‘)-1  since  the  j th  column  of  any  row  is  not  seen  with  probability  (1  —  p)  and 


this  column  must  not  be  seen  in  each  of  the  d(i)  —  1  other  rows  of  the  same  type  as  row  i. 
Combining  these  facts,  we  have 

v,  <  f(i -p)J(i|-  + 


V 


tZi  mp(d(i)  -  ) 


-  I  y'd  _  +  A  y^  y^  Mm'X1  p2)Ml’' j  1 

-  „2^  P)  <0-1 


nr=1 


Recall  that  our  goal  is  to  upper  bound  the  sum  £f_ q  Vt.  Applying  the  above  upper  bound 
for  Vt  we  get 

£ " s  §  (if?1  - «~)  """)  •  <» 

We  now  evaluate  the  first  part  of  the  above  expression.  We  begin  by  noting  that 


Since 


‘-s)  ■"=/»  (V)  <“=1  (I) 


d(,y 


we  get  that 


isKo -r"-)  ■  ;§o* 


n  i 

1  +  mV  — T  =  fcm  +  1. 

<0 


(4) 


We  obtain  the  last  step  of  the  equality  by  rewriting  the  summation  to  go  over  all  the  row 
types:  there  are  n(r)  terms  for  rows  of  type  r  and  thus  each  row  type  contributes  1  to  the 
summation. 

We  are  now  ready  to  evaluate  the  second  part  of  Expression  3.  To  complete  the  proof  of 
the  theorem  we  must  show  that 

s~'  Mm') 

First  we  consider  the  case  in  which  h(i,i')  =  1.  In  this  case  the  second  part  of  Expres¬ 
sion  simplifies  to: 


£  !££#%>  -P2)M,”V'  =  OlnkVH). 


1 


irS-w-i 


=  O(nk). 


(5) 
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Next  we  consider  when  >  1.  We  begin  by  deriving  an  upper  bound  for  the  second 

part  of  Expression  3  of 


LsUM 


dt  I  . 


By  analyzing  the  integral  and  applying  the  inequality  e1  >  1  +  i  we  get 

f  (*  -  (i)2)  dt  s  /> {-  (i)  -  '>} dt 

A  standard  integral  table  [9]  gives 

J  exp  {— a2x2|  dx  =  —  y/n. 

So  letting  a  =  Vh^  I  1  and  x  =  t  one  gets  that 

fexp  {-  (?);  wi-°  - *>}  * = • 


By  combining  equations  (6)  and  (7)  we  get  that 


/.'(■-ft)') 


di  < 


Sy/ir 


2s/h(i,i')-V 

Therefore,  the  second  part  of  Expression  3  can  be  upper  bounded  by 


dt 


Y^  Y'  h(i,i')  ( i  ^ _ S\Jn  ^ 


h(i,  i ') 


~'shbd^-  1+  2  ^fe(d(0-l)v/h(f,t')-l 

=  0  nk  +  nks/ll^j  =  0(nkVH). 


(6) 


(7) 


(8) 


(9) 


Finally  putting  together  Equations  4,  5  and  9  we  obtain  an  0(km+nk\/H)  bound  as  desired. 


To  complete  the  theorem,  we  prove  the  main  result  of  this  section,  namely,  the  upper 
bound  obtained  under  this  simple  probabilistic  model  holds  (to  within  a  constant  factor) 
for  the  true  model.  In  other  words,  to  compute  an  upper  bound  on  the  number  of  mistakes 
made  by  a  prediction  algorithm  when  the  instances  are  selected  according  to  a  uniform 
distribution  on  the  instance  space,  one  can  replace  the  requirement  that  exactly  t  matrix 
entries  are  known  by  the  requirement  that  each  matrix  entry  is  known  with  probability  — . 
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Lemma  5  £?=  J  Ut  =  O  (£?=J  Vt)  ■ 
Proof:  We  first  note  that 


Vt  = 


uT. 


To  see  this,  observe  that  for  each  r,  where  r  is  the  number  of  known  entries,  we  need  just 
multiply  Ur  by  the  expectation  that  exactly  r  entries  are  known  where  each  entry  is  known 
with  probability  of  t/S.  Therefore, 


5-1  5-15-1  /  e\  /  l\T  f  4  \  5-r 

E v<  =  (5)  (i-| 

t= 0  t=0  r=0  \rJ  xo/  x  0/ 


(10) 


=  Y.Vr 


') 


Thus,  to  prove  the  lemma,  it  suffices  to  show  that  the  inner  summation  is  bounded  below 
by  a  positive  constant.  By  symmetry,  assume  that  r  <  S/2  and  let  y  =  S  —  r.  By  applying 
Stirling’s  approximation  we  obtain  that 


=  e( 


ss  rs 


\rryv  V  ry  t 

Applying  this  formula  to  the  desired  summation  we  obtain  that 

tvm-it  ■  '(Mmw) 

■  ‘'[0‘WW} 

The  last  step  above  follows  by  letting  x  =  t—r  and  reducing  the  limits  of  the  summation.  To 
complete  the  proof  that  the  inner  summation  of  Equation  11  is  bounded  below  by  a  positive 
constant  we  need  just  prove  that 

for  all  1  <  x  <  \Jry/S 

V 

Using  the  inequality  1  +  x  <  er,  it  can  be  shown  that  for  1  +  y  >  0,  1  +  y  >  e'  +  v .  We 
apply  this  observation  to  get  that 

-  K)' H)' 
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>  exp  | 

f  X  X 

[>+f  1  -f  J 

|  =  exp  \ 

\  rx  yx  ^ 

[r  +  x  y-xj 

=  exp  | 

f  -Xj(r  +  y)  \ 

[  =  exp  | 

—Sx2  \ 

[  (r -t- x)(y  -  z)  J 

(r  +  x)(y  —  x)  J 

Since  x  <  ^ Jry/S ,  it  follows  that  Sx 2  <  ry.  Applying  this  observation  to  the  above  inequality 
we  get  that 


>  exp  /■  -7  1 

l  (r +  *)(»- *)J 


(  ~ry  \ 

[ry  +  (y  -r 

\  -ry  ] 

)x  -  I2  J 
|  =  exp  | 

[ry-ry/S  J 

-1 

Finally  we  note  that  for  S  >  2,  e1-1/5  >  e  .  This  completes  the  proof  of  the  lemma.  ■ 
Combining  Lemma  4  and  Lemma  5  we  obtain  that  Ylt=o  Ut  =  0(km  -f  nky/H)  giving 
the  desired  result.  ■ 

This  completes  our  discussion  of  learning  a  k- binary- relation. 


4  Learning  a  Total  Order 

In  this  section  we  present  our  results  for  learning  a  binary  relation  on  a  set  where  it  is  known 
a  priori  that  the  relation  forms  a  total  order.  One  can  view  this  problem  as  that  of  learning 
a  total  order  on  a  set  of  n  objects  where  an  instance  corresponds  to  comparing  which  of  two 
objects  is  greater  in  the  target  total  order.  Thus  this  problem  is  like  comparison-based  sorting 
except  for  two  key  differences:  we  vary  the  agent  selecting  the  order  in  which  comparisons 
are  made  (in  sorting  the  learner  does  the  selection)  and  we  charge  the  learner  only  for  all 
incorrectly  predicted  comparisons. 

Before  describing  our  results,  we  motivate  this  section  with  the  following  example.  There 
are  n  basketball  teams  that  are  competing  in  a  round-robin  tournament.  That  is,  each  team 
will  play  all  other  teams  exactly  once.  Furthermore,  we  make  the  (admittedly  simplistic) 
assumption  that  there  is  a  ranking  of  the  teams  such  that  a  team  wins  its  match  if  and  only  if 
its  opponent  is  ranked  below  it.  The  learner  wants  to  place  a  $10  bet  on  each  game:  if  it  bets 
on  the  winning  team  it  wins  $10  and  if  it  bets  on  the  losing  team  it  loses  $10.  Of  course,  the 
goal  of  the  learner  is  to  win  as  many  bets  as  possible.  We  formalize  the  problem  of  learning 
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a  total  order  as  follows.  The  instance  space  X  =  U„>i^n  =  {l,...,n}  x  {l,...,n}.  An 
instance  (i,j)  in  X  is  in  the  target  concept  if  and  only  if  object  i  precedes  object  j  in  the 
corresponding  total  order. 

In  the  next  section  we  discuss  the  relation  between  the  halving  algorithm  and  approximate 
counting.  Then  we  show  how  to  use  an  approximation  scheme  to  implement  a  randomized 
version  of  the  approximate  halving  algorithm,  and  apply  this  result  to  the  problem  of  learning 
a  total  order  on  a  set  of  n  element.  Finally,  we  discuss  how  a  majority  algorithm  can  be 
used  to  implement  a  counting  algorithm. 

4.1  The  Halving  Algorithm  and  Approximate  Counting 

In  this  section  we  review  the  halving  algorithm  and  approximate  counting  schemes.  We  first 
cover  the  halving  algorithm  [3,  15].  Let  V  denote  the  set  of  concepts  in  Cn  that  are  consistent 
with  the  feedback  from  all  previous  queries.  Given  an  instance  x  in  Xn,  for  each  concept  in  V 
the  halving  algorithm  computes  the  prediction  of  that  concept  for  x  and  predicts  according 
to  the  majority.  Finally,  all  concepts  in  V  that  are  inconsistent  with  the  correct  prediction 
are  deleted.  Littlestone  [15]  shows  that  this  algorithm  makes  at  most  lg]C„|  mistakes.  Now 
suppose  the  prediction  algorithm  predicts  according  to  the  majority  of  concepts  in  set  V',  the 
set  of  all  concepts  in  C„  consistent  with  ail  incorrectly  predicted  instances.  Littlestone  [15] 
also  proves  that  this  space-efficient  halving  algorithm  makes  at  most  lg|C„|  mistakes.  So 
for  any  prediction  algorithm  A  that  only  remembers  its  mistakes,  the  number  of  instances 
stored  by  A  is  bounded  by  MB(A,Cn). 

We  define  an  approximate  halving  algorithm  to  be  the  following  generalization  of  the 
halving  algorithm.  Given  instance  x  in  Xn  an  approximate  halving  algorithm  predicts  in 
agreement  with  at  least  <^|V|  of  the  concepts  in  V  for  some  constant  0  <  <  1/2. 

Theorem  11  An  approximate  halving  algorithm  makes  at  most  log^.^-i  |Cn|  mistakes  for 
concept  class  C . 

Proof:  Each  time  a  mistake  is  made,  the  number  of  concepts  that  remain  in  V  are  reduced 
by  a  factor  of  at  least  1  —  <p.  Thus  after  at  most  log(,_v)_i  \Cn  \  mistakes  there  is  one  consistent 
concept  left  in  C„.  ■ 

We  note  that  the  above  result  holds  for  the  space-efficient  version  of  the  approximate 
halving  algorithm. 
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We  now  introduce  the  notion  of  an  approximate  counting  scheme  for  counting  the  number 
of  elements  in  a  finite  set  <S.  Let  i  be  a  description  of  a  set  Sx  in  some  natural  encoding. 
An  exact  counting  scheme  on  input  x  outputs  [<£.,.(  with  probability  1.  Such  a  scheme  is 
polynomial  if  it  runs  in  time  polynomial  in  Jar}.  Sometimes  exact  counting  can  be  done  in 
polynomial  time;  however,  the  counting  problem  is  often  #'P-complete  and  thus  assumed  to 
be  intractable.  (For  a  discussion  of  the  class  #V  see  Valiant  [22].)  For  many  ^'P-complete 
problems  good  approximations  are  possible  [12,  20,  21].  A  randomized  approximation  scheme , 
R,  for  a  counting  problem  satisfies  the  following  condition  for  all  e,6  >0: 

Pr  <*(*,(,«)  <15,1(1  +  0  >1-6 

where  R(x,e,S)  is  R's  estimate  on  input  x,e,  and  6.  Such  a  scheme  is  fully  polynomial  if  it 
runs  in  time  polynomial  in  |x|,  and  lg|.  For  a  further  discussion  see  Sinclair  [20]. 

We  now  review  work  on  counting  the  number  of  linear  extensions  of  a  total  order.  That 
is,  given  a  partial  order  on  a  set  of  n  elements,  the  goal  is  to  compute  the  number  of  total 
orders  that  are  linear  extensions  of  the  given  partial  order.  We  discuss  the  relationship 
between  this  problem  and  that  of  computing  the  volume  of  a  convex  polyhedron.  (For  more 
details  on  this  subject,  see  Section  2.4  of  Lovasz  [17].)  Given  a  convex  set  S  and  an  element 
a  of  !ftn  a  weak  separation  oracle 

1.  Asserts  that  a  £  S,  or 

2.  Asserts  that  a  £  S  and  supplies  a  reason  why.  In  particular  for  closed  convex  sets  in 
3?n,  if  a  £  S  then  there  exists  a  hyperplane  separating  a  from  S.  So  if  a  0  S.  the  oracle 
responds  with  such  a  separating  hyperplane  as  the  reason  why  a  £  S. 

We  now  discuss  how  to  reduce  the  problem  of  counting  the  number  of  extensions  of  a  partial 
order  on  n  elements  to  that  of  computing  the  volume  of  a  convex  n-dimensional  polyhedron 
given  by  a  separation  oracle.  The  polyhedron  built  in  the  reduction  will  be  a  subset  of 
the  unit  hypercube  in  where  the  polyhedron  is  determined  by  the  intersection  of  the 
halfspaces  given  by  the  inequalities  of  the  partial  order.  (See  Figure  9  for  an  example  with 
n  =  3.)  For  a  total  order  the  polyhedron  is  a  simplex  such  that  for  any  pair  of  total  orders 
the  simplices  only  intersect  in  a  face  (zero  volume).  Let  Tn  be  the  set  of  all  n!  total  orders 
on  n  elements.  Then 

unit  hypercubc  in  =  (J  polyhedron  defined  by  t.  (11) 
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(0.0,0)  (1,0,0) 


Figure  9:  The  polyhedron  formed  by  the  total  order  z  >  y  >  x. 

Let  P  be  a  partial  order  on  a  set  of  n  elements.  From  Equation  1 1  and  the  observation  that 
the  volumes  of  the  polyhedra  formed  by  each  total  order  is  equal,  it  follows  that  the  volume 
of  the  polyhedron  defined  by  any  total  order  is  1/n!.  Thus  it  follows  that  for  any  partial 
order  P 

number  of  extensions  of  P  ,  ,  .  .  ,  ,  _  .  , 

- j - =  volume  of  polyhedron  defined  by  P.  (12) 

Rewriting  equation  (12),  we  obtain  that 

number  of  extensions  of  P  =  n!  •  (volume  of  polyhedron  denned  by  P) .  (13) 

Finally,  we  note  that  the  weak  separation  oracle  is  easy  to  implement  for  any  partial 
order.  Given  inputs  a  and  S,  it  just  checks  each  inequality  of  the  partial  order  to  see  see 
if  a  is  in  the  convex  polyhedron  S.  If  a  does  not  satisfy  some  inequality  then  reply  that 
a  S  and  return  that  inequality  as  the  separating  hyperplane.  Otherwise,  if  a  satisfies  all 
inequalities,  reply  that  a  E  S. 

Dyer,  Frieze  and  Kannan  [7]  give  a  fully-polynomial  randomized  approximation  scheme 
( fpras )  to  approximate  the  volume  of  a  polyhedron  given  a  separation  oracle.  From  Equa¬ 
tion  13  we  see  that  this  fpras  for  estimating  the  volume  of  a  polyhedron  can  be  easily  applied 
to  estimate  the  number  of  extensions  of  a  partial  order.  Furthermore,  Dyer  and  Frieze  [8] 
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prove  that  it  is  #'P-hard  to  exactly  compute  the  volume  a  polyhedron  given  either  by  a  list 
of  its  facets  or  its  vertices. 

Independently,  Matthews  [18]  has  described  an  algorithm  to  generate  a  random  linear 
extension  of  a  partial  order.  Consider  the  convex  polyhedron  K  defined  by  the  partial 
order.  Matthew’s  main  result  is  a  technique  to  sample  nearly  uniformly  from  K .  Given 
such  a  procedure  to  sample  uniformly  from  /<",  one  can  sample  uniformly  from  the  set  of 
extensions  of  a  partial  order  by  choosing  a  random  point  in  K  and  then  selecting  the  total 
order  corresponding  to  the  ordering  of  the  coordinates  of  the  selected  point.  A  procedure  to 
generate  a  random  linear  extension  of  a  partial  order  can  be  used  repeatedly  to  approximate 
the  number  of  linear  extensions  of  a  partial  order  [18]. 

4.2  Application  to  Learning 

In  this  section  we  show  how  to  use  a  fpras  to  implement  a  randomized  version  of  the  ap¬ 
proximate  halving  algorithm,  and  apply  this  result  for  the  problem  of  learning  a  total  order 
on  a  set  of  n  elements. 

Under  the  teacher-selected  query  sequence  we  obtain  an  n  — 1  mistake  bound.  The  teacher 
can  uniquely  specify  the  target  total  order  by  giving  the  n  —  1  instances  that  correspond  to 
consecutive  elements  in  the  target  total  order.  Since  n  —  1  instances  are  needed  to  uniquely 
specify  a  total  order,  we  get  a  matching  lower  bound.  Winkler  [25]  has  shown  that  under  the 
learner-selected  query  sequence,  one  can  also  obtain  an  n  —  1  mistake  bound.  To  achieve  this 
bound  the  learner  uses  an  insertion  sort,  as  described  by  Cormen,  Leiserson,  and  Rivest  [6], 
where  for  each  new  element  the  learner  guesses  it  is  smaller  than  each  of  the  ordered  elements 
(starting  with  the  largest)  until  a  mistake  is  made.  When  a  mistake  occurs  this  new  element 
is  properly  positioned  in  the  chain.  Thus  at  most  n—  1  mistakes  will  be  made  by  the  learner. 
We  now  argue  that  learner  can  be  forced  to  make  n  —  1  mistakes.  The  adversary  that 
gives  feedback  using  the  following  simple  strategy:  the  first  time  an  object  is  involved  in  a 
comparison  reply  that  learner’s  prediction  is  wrong.  In  doing  so,  one  creates  a  set  of  chains 
where  a  chain  is  a  total  order  on  a  subset  of  the  elements.  If  c  chains  are  created  by  this 
process  then  the  learner  has  made  n  —  c  mistakes.  Since  all  these  chains  must  be  combined 
to  get  a  total  order,  the  adversary  can  force  c  —  1  additional  mistakes  by  always  replying 
that  a  mistake  occurs  the  first  time  when  elements  from  two  different  chains  are  compared. 
(It  is  not  hard  to  see  that  the  above  steps  can  be  interleaved.)  Thus  the  adversary  can  force 
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n  —  1  mistakes. 

Next  we  consider  the  case  that  an  adversary  selects  the  query  sequence.  We  first  prove 
an  fl(n  lgn)  lower  bound  on  the  number  of  mistakes  made  by  any  prediction  algorithm.  We 
use  the  following  result  of  Kahn  and  Saks  [13].  Given  any  partial  order  P  that  is  not  a  total 
order  there  exists  an  incomparable  pair  of  elements  X{,Xj  such  that 

3  number  of  extensions  of  P  with  x,  <  Xj  <  8 
11  —  number  of  extensions  of  P  ~  11 

So  the  adversary  can  always  pick  a  pair  of  elements  so  that  regardless  of  the  learner’s  pre¬ 
diction,  the  adversary  can  report  that  a  mistake  was  made  while  only  eliminating  a  constant 
fraction  of  the  remaining  total  orders. 

We  now  present  a  polynomial  prediction  algorithm  making  nlgn  +  (lge)lgn  mistakes 
with  very  high  probability.  We  first  show  how  to  use  an  exact  counting  algorithm  R ,  for 
counting  the  number  of  concepts  in  Cn  consistent  with  a  given  set  of  examples,  to  implement 
the  halving  algorithm. 

Lemma  6  Given  a  polynomial  algorithm  R  to  exactly  count  the  number  of  concepts  in  Cn 
consistent  with  a  given  set  E  of  examples,  one  can  construct  an  efficient  implementation  of 
the  halving  algorithm  for  C . 

Proof:  We  show  how  to  use  R  to  efficiently  make  the  predictions  required  by  the  halving 
algorithm.  To  make  a  prediction  for  an  instance  x  in  Xn  the  following  procedure  is  used: 
Construct  E~  from  E  by  appending  x  as  a  negative  example  to  E.  Use  the  counting  algorithm 
R  to  count  the  number  of  concepts  C~  (E  V  that  are  consistent  with  E~.  Next  construct  E+ 
from  E  by  appending  x  as  a  positive  example  to  E.  As  before,  use  R  to  count  the  number 
of  concepts  C+  €  V  that  are  consistent  with  E+ .  Finally  if  C~  >  C+  then  predict  that  x  is 
a  negative  example,  otherwise  predict  that  x  is  a  positive  example. 

Clearly  a  prediction  is  made  in  polynomial  time,  since  it  just  requires  calling  R  twice. 
We  claim  that  it  predicts  according  to  the  majority  of  concepts  in  V.  Note  that  C~  is  the 
number  of  concepts  in  V  for  which  x  is  a  negative  instance.  Likewise,  C+  is  the  number 
of  concepts  in  V  for  which  x  is  a  positive  instance.  Thus  it  immediately  follows  that  the 
prediction  agrees  with  the  majority  of  concepts  in  V.  ■ 

We  modify  this  basic  technique  to  use  a  fpras  instead  of  the  exact  counting  algorithm 
to  obtain  an  efficient  implementation  of  a  randomized  version  of  the  approximate  halving 
algorithm. 
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Theorem  12  Let  R  be  a  fpras  for  counting  the  number  of  concepts  in  C„  consistent  with 
a  given  set  E  of  examples.  If  |A^n|  is  polynomial  in  n,  one  can  produce  a  prediction  al¬ 
gorithm  that  for  any  6  >  0  runs  in  time  polynomial  in  n  and  in  lg  |  and  makes  at  most 
lg  |Cnl  (l  +  mistakes  with  probability  at  least  1—6. 

Proof:  The  prediction  algorithm  implements  the  procedure  described  in  Lemma  6  with  the 
exact  counting  algorithm  replaced  by  the  fpras  R(n,  L,  jjJTf)-  Consider  the  prediction  for  an 
instance  x  €  Xn.  Let  V  be  the  set  of  concepts  that  are  consistent  with  all  previous  instances. 
Let  r+  (respectively  r~)  be  the  number  of  concepts  in  V  for  which  x  is  a  positive  (negative) 
instance.  Let  r+  (respectively  r~)  be  the  estimate  output  by  R  for  r+  (r~).  Since  R  is  a 
fpras,  with  probability  at  least  1  — 


Y~—  —  ^  <  (1  +  c)r~  and  y-j—  <  r+  <  (1  +  e)r+. 

Without  loss  of  generality,  assume  that  the  algorithm  predicts  that  x  is  a  neg-  tive  instance, 
and  thus  r-  >  r+.  Combining  the  above  inequalities  and  the  observation  that  r~  +  r+  =  |V|, 
we  obtain  that  r~  >  as . 

We  define  an  appropriate  prediction  to  be  a  prediction  that  agrees  with  at  least 
of  the  concepts  in  V.  To  analyze  the  mistake  bound  for  this  algorithm,  suppose  that  each 
prediction  is  appropriate.  For  a  single  prediction  to  be  appropriate,  both  calls  to  the  fpras  R 
must  output  a  count  that  is  within  a  factor  of  1  -f  e  of  the  true  count.  So  any  given  prediction 
is  appropriate  with  probability  at  least  1  —  -ry-i,  and  thus  the  probability  that  all  predictions 
are  appropriate  is  at  least 

Clearly  if  all  predictions  are  appropriate  then  the  above  procedure  is  in  fact  an  implemen¬ 
tation  of  the  approximate  halving  algorithm  with  tp  —  an<^  thus  by  Theorem  1 1  at 

most  log^j^-i  |Cn|  mistakes  are  made.  Substituting  e  with  its  value  of  L  and  simplifying 
the  expression  we  obtain  that  with  probability  at  least  1  —  6, 


#  mistakes  < 


1*10.1 
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Applying  the  inequalities  lg  (l  —  and  1  +  lg  (l  —  <  1  -  we  get  that 
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Finally,  applying  these  inequalities  to  Equation  14  yields  that 

lg|Cn| 


#  mistakes  < 


Is  (l  +  Jwfefl) 


<  lg  I C, 


Note  that  we  could  modify  the  above  proof  by  not  requiring  that  all  predictions  be 
appropriate.  In  particular  if  we  allow  7  predictions  to  not  be  appropriate  then  we  get  a 
mistake  bound  of  lg  |C„|  (l  +  +  7- 

We  now  apply  this  result  to  obtain  the  main  result  of  this  section.  Namely,  we  describe 
a  randomized  polynomial  prediction  algorithm  for  learning  a  total  order  in  the  case  that  the 
adversary  selects  the  query  sequence. 

Theorem  13  There  exists  a  prediction  algorithm.  A  for  learning  total  orders  such  that  on 
input  6  (for  all  8  >  0),  and  for  any  query  sequence  provided  by  the  adversary,  A  runs  in 
time  polynomial  in  n  and  lg  j  and  makes  at  most  n  lg  n  +  (lg  e)  lg  n  mistakes  with  probability 
at  least  1  —  8. 

Proof  Sketch:  We  apply  the  results  of  Theorem  12  using  the  fpras  for  counting  the  number 
of  extensions  of  a  partial  order  given  independently  by  Dyer,  Frieze  and  Kannan  [7],  and  by 
Matthews  [18].  We  know  that  with  probability  at  least  1  —  8,  the  number  of  mistakes  is  at 
most  lg  jCn|  (l  -f  ^).  Since  |C„|  =  n!  the  desired  result  is  obtained.  ■ 

We  note  that  the  probability  that  A  makes  more  than  n  lg  n  +  (lg  e)  lg  n  mistakes  does 
not  depend  on  the  query  sequence  selected  by  the  adversary.  The  probability  is  taken  over 
the  coin  flips  of  the  randomized  approximation  scheme. 
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Thus,  as  in  learning  a  ^-binary-relation  using  a  row-filter  algorithm,  we  see  that  a  learner 
can  do  asymptotically  better  with  self-directed  learning  versus  adversary-directed  learning. 
However,  while  the  self-directed  learning  algorithm  is  deterministic,  the  adversary-directed 
algorithm  is  randomized.  To  accommodate  such  randomized  prediction  algorithms  we  pro¬ 
vide  the  learner  with  an  input  8  ana  allow  the  algorithm  to  exceed  the  “worst-case”  r. dstake 
bound  with  a  probability  8. 


4.3  Majority  Algorithms  vs  Counting  Algorithms 


In  this  section  we  discuss  the  relationship  between  counting  schemes  and  the  halving  algo¬ 
rithm.  Let  W  be  a  set  of  elements  for  which  some  subset  S  of  the  elements  are  distinguished. 
We  use  the  function  g  that  maps  an  element  of  W  to  {0, 1}  to  represent  which  of  the  elements 
of  W  are  distinguished.  Specifically,  for  w  €  W,  g(w)  =  1  if  and  only  if  w  is  a  distinguished 
element.  Let  £*  be  an  alphabet  used  to  describe  some  subset  of  W.  Let  /  be  a  function 
from  £*  — »  2W  that  maps  a  £  £*  to  the  subset  of  W  that  it  describes.  We  denote  f(a)  by 
Vcr.  Let  ta  =  |Vff|,  and  let  da  be  the  number  of  distinguished  elements  in  V„.  Formally,  we 
have  da  =  |{u>  €  Va  •  g(w)  =  1 } |.  Finally,  ua  =  \Va\  —  da ■  That  is,  ua  is  the  number  of 
undistinguished  elements  in  V*.  So  da  +  ua  —  t„. 

A  majority  algorithm  takes  as  input  a  and  outputs  a  bit  that  is  1  if  d„  >  u„,  and  0  if 
d„  <  Uo-  That  is, 


maajority(<t) 


1  if  do  >  Uo 
0  if  do  <  Uo 


On  the  other  hand,  a  counting  algorithm  must  output  da.  In  Lemma  6  we  used  a  counting 
algorithm  to  implement  a  majority  algorithm.  (There  Va  =  Cn  and  an  element  of  Va  is 
distinguished  if  and  only  if  it  is  consistent  with  the  given  set  of  examples.)  In  this  section 
we  discuss  using  a  majority  algorithm  to  implement  a  counting  algorithm.  The  results  of 
this  section  are  preliminary.  Although  we  describe  two  techniques  to  convert  a  majority 
algorithm  to  a  counting  algorithm,  we  do  not  have  applications  for  these  techniques  that 
yield  any  previously  unknown  results.  We  apply  the  first  technique  to  an  example  problem; 
however,  it  is  trivial  to  construct  a  counting  algorithm  for  this  problem. 
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Figure  10:  An  illustration  of  the  constrain  oracle  when  the  majority  of  the  elements  in 
Va  are  distinguished. 


4.3.1  First  Approach 


In  this  section  we  describe  our  first  approach  for  using  a  majority  algorithm  to  implement  a 
counting  algorithm.  We  then  show  that  our  algorithm  can  be  used  to  convert  an  algorithm 
that  outputs  if  a  majority  of  k- CNF  formulae  classify  a  given  instance  as  positive  to  one  that 
computes  the  number  of  fc-CNF  formulae  that  classify  the  given  instance  as  positive. 

We  now  describe  our  first  approach.  Here  we  recursively  apply  majority,  at  each  step 
reducing  the  larger  of  d„  and  ua  by  a  factor  of  at  least  two.  To  use  this  approach,  in  addition 
to  the  majority  oracle,  a  constrain  oracle  must  be  provided.  The  specification  for  the 
constrain  oracle  is  as  follows: 


constrain(<t)  = 


I 


new-object(d<7  - 
new-object(d(y,u<r  -  \ta/ 2]) 


if  do  >  uc  (i.e.  maajority(<t)  =  1) 
if  do  <  ua  (i.e.  maajority(<t)  =  0) 


where  new-object(d,  u)  creates  a  word  a  €  S*  such  that  do  =  d  and  ua  =  u.  So  the  oracle 
constrain  just  reduces  the  larger  of  dc  and  ua  by  a  factor  of  [fff/2].  The  result  of  the 
constrain  oracle  is  illustrated  in  Figure  10. 


Theorem  14  One  can  construct,  from  a  majority  and  constrain  oracle,  a  counting 
algorithm  that  on  input  a  uses  at  most  lg(<CT)  calls  to  both  oracles. 
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Proof:  We  construct  recursive  counting  procedure  that  takes  input  cr,  and  uses  constrain 
to  reduce  the  larger  of  da  and  ua.  The  initial  call  should  be  Exact-Count  1(a). 

Exact-Count  l(cr) 

1  if  ta  =  0 

2  then  return  0 

3  else  return  ft<T/2]MAAJORiTY(<7)  +  Exact-Count  l(co  a  straw  (a), (G/2J) 

We  first  argue  that  this  procedure  is  correct.  Each  time  constrain  is  called,  we  know 
that  da  is  reduced  by  exactly  [G/2]  m  aajority(ct).  So  the  total  decrease  in  da  from  its 
original  value  is  accumulated  in  line  3  of  the  procedure.  Finally,  when  t„  =  0  then  clearly 
da  =  0,  thus  the  base  case  is  correct. 

Since  at  each  step  t„  is  reduced  to  \ta/ 2J,  at  most  lg(G)  recursive  calls  are  made.  Fur¬ 
thermore,  majority  and  constrain  are  called  only  once  for  each  recursive  call.  ■ 

In  terms  of  our  original  goal  of  converting  a  majority  algorithm  into  a  counting  algorithm 
we  have  the  following  corollary  of  Theorem  14. 

Corollary  3  Let  M  be  a  majority  algorithm  for  W  using  £*.  Let  Tj^(a)  be  the  running 
time  of  M.  on  input  a.  Furthermore,  suppose  that  constrain  can  be  implemented  in  time 
Tc(a)  on  input  a.  Then  there  exists  an  exact  counting  algorithm  that  runs  in  time  at  most 
(TM(a)  +  Tc(a))\gtc. 

We  now  apply  this  conversion  to  the  following  problem.  Given  a  boolean  vector  x  = 
{0,1  }n,  compute  the  number  of  fc-CNF  formula  over  n  variables  for  which  x  is  a  positive 
example.  As  Angluin  notes  [2],  Valiant’s  algorithm  for  PAC-leaming  fc-CNF  implements  the 
halving  algorithm.  That  is,  given  an  instance  x,  it  replies  that  x  is  a  positive  instance  if 
and  only  if  at  least  half  of  the  remaining  fc-CNF  formulas  classify  x  as  a  positive  example. 
Furthermore,  each  prediction  is  made  in  polynomial  time.  Therefore,  for  the  given  counting 
problem,  we  have  the  needed  majority  algorithm.  We  now  apply  Corollary  3  to  fc-CNF  to 
obtain  the  following  result. 

Lemma  7  There  exists  a  polynomial  time  algorithm  to  exactly  count  the  number  of  k-CNF 
formulas  for  which  some  x  £  Xn  is  a  positive  instance. 

Proof:  Let  a  be  select  from  {0,  l}n  where  the  interpretation  is  that  a  gives  the  assignments 
to  the  n  variables.  So  V„  is  computed  by  evaluating  the  target  formula  on  the  assignment 
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given  by  a  and  including  those  that  evaluate  to  1.  Thus  Valiant’s  [23]  algorithm  for  learning 
k- CNF  can  serve  as  the  majority  algorithm.  We  now  prove  that  the  constrain  oracle 
can  also  be  implemented  in  polynomial  time.  As  Angluin  [2]  notes,  if  Valiant’s  algorithm 
predicts  0,  then  there  exists  some  clause  r  in  the  learner’s  hypothesis  that  evaluates  to  0. 
So  by  removing  r  the  requirements  of  constrain  are  satisfied.  ■ 

We  note  that  this  result  is  easily  obtained  without  using  Corollary  3.  By  studying  the 
recursive  structure  of  the  counting  algorithm,  we  obtain  the  following  counting  algorithm  for 
fc-CNF.  Let  T  be  the  number  of  possible  clauses  of  size  k  or  less  that  are  true  for  instance 
x.  Since  the  target  formula  can  contain  any  subset  of  these  clauses  (but  none  of  the  clauses 
that  ate  negative  for  x),  the  number  of  Ar-CNF  formulas  that  predict  that  x  is  a  positive 
instance  is  2T. 

4.3.2  Second  Approach 

In  this  section  we  describe  our  second  approach  to  using  a  majority  algorithm  to  implement 
a  counting  algorithm.  To  motivate  this  approach,  we  consider  how  the  first  approach  might 
fail.  The  algorithm  Exact-Countl  can  be  used  only  if  one  can  remove  elements  of  Va  in  a 
controlled  manner.  However,  it  may  be  the  case  that  one  cannot  remove  elements  from  V„  as 
desired,  but  can  create  some  a'  such  that  D  V„  and  furthermore,  all  elements  of  Va<  —  Va 
are  distinguished  (or  undistinguished).  Our  second  approach  is  based  on  these  ideas;  instead 
of  adjusting  the  size  of  d„  and  ua  by  reducing  their  sizes,  we  achieve  the  same  effect  by 
appropriately  increasing  the  size  of  the  smaller  one. 

To  use  this  approach,  in  addition  to  the  majority  oracle,  a  expand  oracle  must  be 
provided  instead  of  the  constrain  oracle.  The  specification  of  the  expand  oracle  is  as 
follows: 

.  [  new-object(d<T,u(T  +  i)  if  d„  >  ua 

EXPA  N  D  (<7, 1)  =  < 

I  new-object(dCT  +  i, ua)  if  d„  <u„ 

So  the  oracle  expand  just  adds  i  elements  to  the  smaller  of  da  and  ua.  The  result  of  the 
expand  oracle  is  illustrated  in  Figure  11. 

We  now  describe  the  details  of  the  second  approach  to  convert  a  majority  algorithm  to 
a  counting  algorithm. 

Theorem  15  One  can  construct,  from,  a  majority  and  an  expand  oracle,  a  counting 
algorithm  that  on  input  <7  uses  at  most  lg(t<r)  calls  to  both  oracles. 
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Figure  11:  An  illustration  of  the  expand  oracle  for  the  case  that  a  majority  of  Vff  is 
distinguished. 

Proof:  As  in  Theorem  14,  the  basic  idea  is  to  use  the  majority  oracle  to  compute  da.  We 
do  this  by  increasing  the  smaller  of  d„  and  ua  so  that  they  are  essentially  equally.  As  before, 
we  use  a  binary  search  technique  to  efficiently  perform  this  task,  where  at  each  step  we  use 
a  call  to  majority  to  determine  in  which  direction  further  adjustments  should  be  made. 
The  details  are  as  follows. 

Exact-Count2{o ,  /?,  i ) 

1  if  i  =  0 

2  then  if  maajority(ct)  =  1 

3  then  return  (ta  -f  0)/2 

4  else  return  (f*  —  0)/2 

5  else  if  maajority(expand(<t,/3))  =  maajority(ct) 

6  then  Exact-Count2(<T, 0  +  fi/2],  [i/2j) 

7  else  Exact-Count2{ cr,  jd  —  [i/2],  [t/2j) 

The  initial  call  should  be  Exact-CountS  (<7, 0,^).  Note  that  0  is  the  current  estimate  of  the 
number  of  elements  that  need  to  be  added  to  the  small  side  to  make  both  sides  equal,  and  i 
is  the  size  of  the  next  adjustment  to  be  made  to  0. 

We  first  argue  that  the  procedure  is  correct.  Let  0i,  ii  denote  the  input  values  of  0  and  i 
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to  the  /th  call  of  Exact-Count2.  Let  07  =  expand(<t,/?,).  Finally,  we  use  d/  (respectively  u/) 
to  denote  dai  (respectively  uat).  Without  loss  of  generality,  assume  that  dc  >  ua.  So  for  all 

l, 

•  di  =  d0  —  da, 

•  Ui  =  u0  -f-  pi  =  ua  -+  0t,  and 

•  if  =  L*/-i/2J- 

We  claim  that  when  the  input  i  =  0  in  Exact-Count2 ,  di  =  tq.  We  prove  this  using  the 
following  lemma. 

Lemma  8  For  all  l,  \dai  —  u0l\  <  i/. 

Proof:  We  use  an  inductive  proof  on  l.  Clearly,  the  base  case,  \da  —  ua\  <  t„,  holds. 

We  now  prove  the  inductive  step.  Assume  inductively  that  \di  —  it/|  =  \da  —  ua  —  0i\  <  i/. 
Our  goal  is  to  show  that 

I d„  -  u„  -  0i+i  |  <  t/+i  = 

We  separately  consider  when  dt  >  «/,  d/  <  u*,  and  d/  =  U;. 

•  Case  1:  d/  >  u/.  In  this  case,  /?{+1  =  0i  +  So  we  must  show  that 

-L*i/2J  <do-uo-0x-  fn/2]  <  L*f/2J. 

That  is,  we  must  show  that: 

-  L*//2J  <  dff  -  u„  -  0i  <  l*,/2J  +  r»i/2]  =  ij. 

The  inequality  da  —  —  0{  <  ii  follows  immediately  from  the  inductive  hypothesis. 

For  the  other  inequality  note  that  —  [if/2 J  <  1.  Furthermore,  since  d/  >  u/, 

da  —  ua  —  0i  =  d\  —  ui>  1. 

•  Case  2:  d/  <  u/.  In  this  case,  $+1  =  0i  —  [i//2].  So  we  must  show  that 

—  [if/2j  <d0-u0  +  0i-  [i,/2l  <  [*i/2j . 

The  proof  for  these  inequalities  is  similar  to  that  of  Case  1. 
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•  Case  3:  <fj  =  tq.  For  this  case  we  first  use  an  inductive  proof  to  show  that  di  +  m  =  q 
(mod  2).  The  base  case  follows  from  the  fact  that  «o  =  do  +  uo-  For  the  inductive 
step,  observe  that  i/  —  ii+1  =  \(dt  -f  tq)  —  (d/+1  +  tq+1)|.  Since  di  =  tq,  it  must  be  that 
i/  divisible  by  two,  and  thus  |q/2J  =  ft// 2].  The  inductive  hypothesis  immediately 
follows  from  here. 

This  completes  the  proof  of  the  lemma.  ■ 

Since  the  recursion  continues  until  q  =  0  it  follows  from  Lemma  8  that  when  i  =  0 
in  Exact-Count2,  di  =  tq.  Thus  we  know  that  for  /?  the  final  value  of  $,  we  have  that 
da  —  u„  +  fd.  Applying  the  equality  that  da  +  u„  =  t0  and  solving  for  da ,  we  get  that 
da  =  (ta  +  j3)/ 2.  (Since  d/  =  u\,  it  follows  that  ta  -f  /?  is  even.)  The  case  where  d„  <  ua  is 
handled  similarly.  Thus  we  have  shown  that  the  output  from  Exact-Count2  is  correct. 

Since  at  each  step  the  increment  size  i  is  reduced  to  f£/2j ,  at  most  lg(t<r)  recursive  calls 
are  made.  Furthermore  each  oracle  is  called  at  most  once  during  each  recursive  call.  ■ 
In  terms  of  our  original  goal  of  converting  a  majority  algorithm  into  a  counting  algorithm 
we  have  the  following  corollary  of  Theorem  15. 

Corollary  4  Let  M  be  a  majority  algorithm  for  W  under  £*.  Let  TM  be  the  running  time 
of  M  on  input  a.  Furthermore,  suppose  that  expand  can  be  implemented  in  time  Tc  on 
input  of  size  at  most  ta.  Then  there  exists  an  exact  counting  algorithm  that  runs  in  time  at 
most  ( Tm  +  Tc)\gta. 

5  Conclusions  and  Open  Problems 

We  have  studied  the  the  problem  of  learning  a  binary  relation  between  two  sets  of  objects  and 
between  a  set  and  itself  under  an  extension  of  the  on-line  learning  model.  We  have  presented 
general  techniques  to  help  develop  efficient  versions  of  the  halving  algorithm.  In  particular, 
we  have  shown  how  a  fpras  can  be  used  to  efficiently  implement  a  randomized  version  of  the 
approximate  halving  algorithm.  We  have  also  extended  the  mistake  bound  model  by  adding 
the  notion  of  an  instance  selector  and  generalizing  it  to  accommodate  randomized  learning 
algorithms.  The  specific  results  are  summarized  in  Table  1.  In  this  table  all  lower  bounds 
are  information-theoretic  bounds  and  all  upper  bounds  are  for  polynomial-time  learning 
algorithms.  Also,  unless  otherwise  stated,  the  results  listed  are  for  deterministic  learning 
algorithms. 
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Concept 

Class 

Director 

Lower 

Bound 

Upper 

Bound 

Notes 

Learner 

*?  +  («  —  Dflgfc-lJ 

km  +  (n-  k)  fig  Ac") 

Teacher 

km  +  (n  -  k)(k  -  1) 

km  +  (n  —  k)(k  —  1) 

Binary  Relation 

Adversary 

km  +  (n  -  fc)flgfcj 

Offcm  +  ny^m  lg  k)  * 

(fc  row  types) 

Adversary 

2m  +  n  -  2 

2m  +  n  —  2 

k=  2 

Adversary 

n(fcm  +  (n  —  k)  lgfc+min{n>/rn,  my/n}) 

km  +  n  \^(k  —  l)m 

row-filter  algorithm 

Uniform  Dist. 

*?  +  tn-|)Llgfc-lJ 

0(km  +  nky/H) 

avg.  case,  row-filter  alg. 

Teacher 

n  -  1 

n  —  1 

Total  Order 

Learner 

n  —  1 

n  —  1  t 

Adversary 

fl(nlgn) 

nlgn  +  (lge)  lgn 

randomized  algorithm 

Table  1:  Summary  of  our  results. 

From  observing  Table  1  one  can  see  that  several  of  the  above  bounds  are  tight  and  several 
others  are  asymptotically  tight.  However,  for  the  problem  of  learning  a  fc-binary-relation 
there  is  a  gap  in  the  bound  for  the  random  and  adversary  (except  k  <  2)  directors.  Note 
that  the  bounds  for  row-filter  algorithms  are  asymptotically  tight  for  k  constant.  Clearly,  if 
we  want  asymptotically  tight  bounds  that  include  a  dependence  on  k  we  must  incorporate  k 
into  the  projective  geometry  lower  bound.  (Currently,  the  relation  created  by  the  adversary 
has  only  two  row  types.) 

For  the  problem  of  learning  a  total  order,  all  the  above  bounds  are  tight  or  asymptotically 
tight.  Although  the  fpras  for  approximating  the  number  of  extensions  of  a  partial  order  is  a 
polynomial-time  algorithm,  the  exponent  on  n  is  somewhat  large  and  the  algorithm  is  quite 
complicated.  Thus  an  interesting  problem  is  to  find  a  “practical”  prediction  algorithm  for 
the  problem  of  learning  a  total  order.  Another  interesting  direction  of  research  is  to  explore 
other  ways  of  modeling  the  structure  in  a  binary  relation.  Finally,  we  hope  to  find  other 
applications  of  fully  polynomial  approximation  schemes  to  learning  theory. 

‘Due  to  Manfred  Warmuth. 

*Due  to  Peter  Winkler. 
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